Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 11, 2011

Uncovering mysteries of InputFormat (Hadoop)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:39 pm

Uncovering mysteries of InputFormat: Providing better control for your Map Reduce execution. by Boris Lublinsky and Mike Segel.

From the post:

As more companies adopt Hadoop, there is a greater variety in the types of problems for which Hadoop’s framework is being utilized. As the various scenarios where Hadoop is applied grow, it becomes critical to control how and where map tasks are executed. One key to such control is custom InputFormat implementation.

The InputFormat class is one of the fundamental classes in the Hadoop Map Reduce framework. This class is responsible for defining two main things:

  • Data splits
  • Record reader

Data split is a fundamental concept in Hadoop Map Reduce framework which defines both the size of individual Map tasks and its potential execution server. The Record Reader is responsible for actual reading records from the input file and submitting them (as key/value pairs) to the mapper. There are quite a few publications on how to implement a custom Record Reader (see, for example, [1]), but the information on splits is very sketchy. Here we will explain what a split is and how to implement custom splits for specific purposes.

See the post for the details.

Something for you to explore over the weekend!

Postgres Plus Connector for Hadoop

Filed under: Hadoop,MapReduce,Pig,PostgreSQL,SQL — Patrick Durusau @ 7:39 pm

Postgres Plus Connector for Hadoop

From the webpage:

The Postgres Plus Connector for Hadoop provides developers easy access to massive amounts of SQL data for integration with or analysis in Hadoop processing clusters. Now large amounts of data managed by PostgreSQL or Postgres Plus Advanced Server can be accessed by Hadoop for analysis and manipulation using Map-Reduce constructs.

EnterpriseDB recognized early on that Hadoop, a framework allowing distributed processing of large data sets across computer clusters using a simple programming model, was a valuable and complimentary data processing model to traditional SQL systems. Map-Reduce processing serves important needs for basic processing of extremely large amounts of data and SQL based systems will continue to fulfill their mission critical needs for complex processing of data well into the future. What was missing was an easy way for developers to access and move data between the two environments.

EnterpriseDB has created the Postgres Plus Connector for Hadoop by extending the Pig platform (an engine for executing data flows in parallel on Hadoop) and using an EnterpriseDB JDBC driver to allow users the ability to load the results of a SQL query into Hadoop where developers can operate on that data using familiar Map-Reduce programming. In addition, data from Hadoop can also be moved back into PostgreSQL or Postgres Plus Advanced Server tables.

A private beta is in progress, see the webpage for details and to register.

Plus, there is a webinar, Tuesday, November 29, 2011 11:00 am Eastern Standard Time (New York, GMT-05:00), Extending SQL Analysis with the Postgres Plus Connector for Hadoop. Registration at the webpage as well.

A step towards seamless data environments. Much like word processing now without the “.” commands. Same commands for the most part but unseen. Data is going in that direction. You will specify desired results and environments will take care of access, processor(s), operations and the like. Tables will appear as tables because you have chosen to view them as tables, etc.

From Småland’s Woods to Silicon Valley

Filed under: Jobs,Topic Maps — Patrick Durusau @ 7:39 pm

From Småland’s Woods to Silicon Valley

Peter Neubauer, think www.neo4j.org, www.ops4j.org and www.qi4j.org, on entrepreneurship.

From the description:

A company is like a baby. And it takes as long to allow it to grow. Don’t fool yourself and be prepared for a journey from Påskallavik to Menlo Park. It takes a village to raise a child, and a community to grow a company.

The scenes from some of the slides beg for further explanation. 😉

But it is a useful slide deck for anyone who wants to form a successful company. It is easy to form the other kind, no instructions are needed.

Peter concludes with pointers to a number of resources that you will find useful in your journey to a successful company.

Enjoy!

PS: One resource Peter points to is: 5 Things to do when you’re unemployed. Hint: It’s not job hunting. by Penelope Trunk. Good advice and highly amusing. Penelope’s blog is “Advice at the intersection of work and life.” (Those are different?) Anyway, when you are not writing, running, breathing topic maps, you will enjoy her blog.

Are You a Cassandra Jedi?

Filed under: Cassandra,Conferences,NoSQL — Patrick Durusau @ 7:38 pm

Are You a Cassandra Jedi?

Cassandra Conference, December 6, 2011, New York City

From the call for speakers:

BURLINGAME, Calif. – November 9, 2011 –DataStax, the commercial leader in Apache Cassandra™, along with the NYC Cassandra User Group, NoSQL NYC, and Big Data NYC are joining together to present the first Cassandra New York City conference on December 6. This all day, two-track event will focus on enterprise use cases as well as the latest developments in Cassandra. Early bird registration is now open here.

Coming on the heels of a sold-out DataStax Cassandra SF earlier this year, the event will feature some of the most interesting Cassandra use-cases from up and down the Eastern Seaboard. Cassandra NYC will be keynoted by Jonathan Ellis, chairman of the Apache Cassandra project, who will highlight what’s new in Cassandra 1.0, and what’s in store for the future. Additional confirmed speakers include Nathan Marz, lead engineer for the Storm project at Twitter and Jim Ancona, systems architect at Constant Contact.

“With the recent 1.0 release, we are seeing users doing amazing new things with Cassandra that are going beyond even our expectations and imagination,” said Ellis. “We look forward to sharing these stories with the broader community, to further hasten the adoption and usage of Cassandra to meet their real-time, big data challenges.”

Call for Speakers and Press Registration

The call for speakers is now also open for the event. Submissions can be made to lynnbender@datastax.com.

Press interested in attending the event may contact Zenobia@intersectcom.com for a complimentary press pass.

The event will be held at the Lighthouse International Conference Center on 59th St.

I am not sure about “early bird” registration for an event less than a month away but this sounds quite interesting. I hope the presentations will be recorded and posted for asynchronous access.

DataStax Enterprise and DataStax Community Edition

Filed under: Cassandra,DataStax,NoSQL — Patrick Durusau @ 7:38 pm

DataStax Enterprise and DataStax Community Edition

From the announcement:

BURLINGAME, Calif. – Nov.1, 2011 –DataStax, the commercial leader in Apache Cassandra™, today announced that DataStax Enterprise, the industry’s first distributed, scalable, and highly available database platform powered by Apache Cassandra™ 1.0, is now available.

“The ability to manage both real-time and analytic data in a simple, massively scalable, integrated solution is at the heart of challenges faced by most businesses with legacy databases,” said Billy Bosworth, CEO, DataStax. “Our goal is to ensure businesses can conquer these challenges with a modern application solution that provides operational simplicity, optimal performance and incredible cost savings.”

“Apache Cassandra is the scalable, high-impact, comprehensive data platform that is well-suited to the rapidly-growing real-time data needs of our social media platform,” said Christian Carollo, Senior Manager, Mobile for GameFly. “We leveraged the expertise of DataStax to deploy our new social media platform, and were able to complete the project without worrying about scale or distribution – we simply built a great application and Apache Cassandra took care of the rest.”

BTW, DataStax just added its 100th customer. You might recognize some of them, Netflix, Cisco, etc.

Hadoop for Data Analytics: Implementing a Weblog Parser

Filed under: Hadoop — Patrick Durusau @ 7:38 pm

Hadoop for Data Analytics: Implementing a Weblog Parser by by Ira Agrawal.

From the post:

With the digitalization of the world, the data analytics function of extracting information or generating knowledge from raw data is becoming increasingly important. Parsing Weblogs to retrieve important information for analysis is one of the applications of data analytics. Many companies have turned to this application of data analytics for their basic needs.

For example, Walmart would want to analyze the bestselling product category for a region so that they could notify users living in that region about the latest products under that category. Another use case could be to capture the area details — using IP address information — about the regions that produce the most visits to their site.

All user transactions and on-site actions are normally captured in weblogs on a company’s websites. To retrieve all this information, developers must parse these weblogs, which are huge. While sequential parsing would be very slow and time consuming, parallelizing the parsing process makes it fast and efficient. But the process of parallelized parsing requires developers to split the weblogs into smaller chunks of data, and the partition of the data should be done in such a way that the final results will be consolidated without losing any vital information from the original data.

Hadoop’s MapReduce framework is a natural choice for parallel processing. Through Hadoop’s MapReduce utility, the weblog files can be split into smaller chunks and distributed across different nodes/systems over the cluster to produce their respective results. These results are then consolidated and the final results are obtained as per the user’s requirements.

Walks you through the process from setting up the Hadoop cluster, loading the logs and then parsing them. Not a bad introduction to Hadoop.

29th International Conference on Machine Learning (ICML-2012)

Filed under: Conferences,Machine Learning — Patrick Durusau @ 7:38 pm

29th International Conference on Machine Learning (ICML-2012) June 26 to July 1 2012

Dates:

  • Workshop and tutorial proposals due February 10, 2012
  • Paper submissions due February 24, 2012
  • Author response period April 9–12, 2012
  • Author notification April 30, 2012
  • Workshop submissions due May 7, 2012
  • Workshop author notification May 21, 2012
  • Tutorials June 26, 2012
  • Main conference June 27–29, 2012
  • Workshops June 30–July 1, 2012

From the call for papers:

The 29th International Conference on Machine Learning (ICML 2012) will be held at the University of Edinburgh, Scotland, from June 26 to July 1 2012.

ICML 2012 invites the submission of engaging papers on substantial, original, and previously unpublished research in all aspects of machine learning. We welcome submissions of innovative work on systems that are self adaptive, systems that improve their own performance, or systems that apply logical, statistical, probabilistic or other formalisms to the analysis of data, to the learning of predictive models, to cognition, or to interaction with the environment. We welcome innovative applications, theoretical contributions, carefully evaluated empirical studies, and we particularly welcome work that combines all of these elements. We also encourage submissions that bridge the gap between machine learning and other fields of research.

FOSDEM

Filed under: Conferences,Graphs — Patrick Durusau @ 7:38 pm

FOSDEM: Free and Open Source Software Developers’ European Meeting

FOSDEM is probably the largest free and non-commercial open source event, taking place in Brussels, Belgium on 4 and 5 February 2012. Being a developer-oriented conference, it is the open source communities and developers that make it what it is. (emphasis added)

The first round call for:

is still open (as of 11/11/11).

A second round call for short talks is forthcoming.

Developer rooms for FOSDEM 2012 include one on graph processing.

FOSDEM, the Free and Open Source Software Developers' European Meeting

IT’s Next Hot Job: Hadoop Guru

Filed under: Hadoop,Jobs — Patrick Durusau @ 7:37 pm

IT’s Next Hot Job: Hadoop Guru by Doug Henschen InformationWeek.

“We’re hiring, and we’re paying 10% more than the other guys.”

Those were the first words from Larry Feinsmith, managing director, office of the CIO, at JPMorgan Chase, in his Tuesday keynote address at Hadoop World in New York. Who JPMorgan Chase is hiring, specifically, are people with Hadoop skills, so Feinsmith was in the right place. More than 1,400 people were in the audience, and attendee polls indicated that at least three quarters of their organizations are already using Hadoop, the open source big data platform.

The “and we’re paying 10% more” bit was actually Feinsmith’s ad-libbed follow-on to the previous keynoter, Hugh Williams, VP of search, experience, and platforms at eBay. After explaining eBay’s Hadoop-based Cassini search engine project, Williams said his company is hiring Hadoop experts to help build out and run the tool.

Feinsmith’s core message was that Hadoop is hugely promising, maturing quickly, and might overlap the functionality of relational databases over the next three years. In fact, Hadoop World 2011 was a coming-out party of sorts, as it’s now clear that Hadoop will matter to more than just Web 2.0 companies like eBay, Facebook, Yahoo, AOL, and Twitter. A straight-laced financial giant with more than 245,000 employees, 24 million checking accounts, 5,500 branches, and 145 million credit cards in use, JPMorgan Chase lends huge credibility to that vision.

JP Morgan Chase has 25,000 IT employees, and it spends about $8 billion on IT each year–$4 billion on apps and $4 billion on infrastructure. The company has been working with Hadoop for more than three years, and it’s easy to see why. It has 150 petabytes (with a “p”) of data online, generated by trading operations, banking activities, credit card transactions, and some 3.5 billion logins each year to online banking and brokerage accounts.

The benefits of Hadoop? Massive scalability, schema-free flexibility to handle a variety of data types, and low cost. Hadoop systems built on commodity hardware now cost about $4,000 per node, according to Cloudera, the Hadoop enterprise support and management software provider (and the organizer and host of Hadoop World). With the latest nodes typically having 16 compute cores and 12 1-terabyte or 2-terabyte drives, that’s massive storage and compute capacity at a very low cost. In comparison, aggressively priced relational data warehouse appliances cost about $10,000 to $12,000 per terabyte.

OK, but what does Hadoop not have out of the box? Can you say cross-domain subject or data semantics? Some “expert – (insert your name)” is going to have to supply the semantics. Have to know the Hadoop ecosystem, but having a firm background in mapping between semantic domains will make you a semantic “top gun.”

Clojure on Hadoop: A New Hope

Filed under: Cascalog,Clojure,Hadoop — Patrick Durusau @ 1:30 pm

Clojure on Hadoop: A New Hope by Chun Kuk.

From the post:

Factual’s U.S. Places dataset is built from tens of billions of signals. Our raw data is stored in HDFS and processed using Hadoop.

We’re big fans of the core Hadoop stack, however there is a dark side to using Hadoop. The traditional approach to building and running Hadoop jobs can be cumbersome. As our Director of Engineering once said, “there’s no such thing as an ad-hoc Hadoop job written in Java”.

Factual is a Clojure friendly shop, and the Clojure community led us to Cascalog. We were intrigued by its strength as an agile query language and data processing framework. It was easy to get started, which is a testament to Cascalog’s creator, Nathan Marz.

We were able to leverage Cascalog’s high-level features such as built-in joins and aggregators to abstract away the complexity of commonly performed queries and QA operations.

This article aims to illustrate Cascalog basics and core strengths. We’ll focus on how easy it is to run useful queries against data stored with different text formats such as csv, json, and even raw text.

Somehow, after that lead in, I was disappointed by what followed.

Curious what others think? As far as it goes, a good article on Clojure but doesn’t really reach the “core strengths” of Cacalog does it?

November 10, 2011

Importing data from another Solr

Filed under: Lucene,Solr — Patrick Durusau @ 6:48 pm

Importing data from another Solr

Luca Cavanna writes:

The Data Import Handler is a popular method to import data into a Solr instance. It provides out of the box integration with databases, xml sources, e-mails and documents. A Solr instance often has multiple sources and the process to import data is usually expensive in terms of time and resources. Meanwhile, if you make some schema changes you will probably find you need to reindex all your data; the same happens with indexes when you want to upgrade to a Solr version without backward compatibility. We can call it “re-index bottleneck”: once you’ve done the first data import involving all your external sources, you will never want to do it the same way again, especially on large indexes and complex systems.

Retrieving stored fields from a running Solr

An easier solution to do this is based on querying your existing Solr whereby it retrieves all its stored fields and reindexes them on a new instance. Everyone can write their own script to achieve this, but wouldn’t it be useful having a functionality like this out of the box inside Solr? This is the reason why the SOLR-1499 issue was created about two years ago. The idea was to have a new EntityProcessor which retrieves data from another Solr instance using Solrj. Recently effort has been put into getting this feature committed to Solr’s dataimport contrib module. Bugs have been fixed and test coverage has been increased. Hopefully this issue will get released with Solr 3.5.

A look ahead to the next release of Solr!

Google1000 dataset

Filed under: Dataset,Image Recognition,Machine Learning — Patrick Durusau @ 6:46 pm

Google1000 dataset

From the post:

This is a dataset of scans of 1000 public domain books that was released to the public at ICDAR 2007. At the time there was no public serving infrastructure, so few people actually got the 120GB dataset. It has since been hosted on Google Cloud Storage and made available for public download: (see the post for the links)

Intended for OCR and machine learning purposes. The results of which you may wish to unite in topic maps with other resources.

Putting Data in the Middle

Filed under: Data,Interoperability — Patrick Durusau @ 6:45 pm

Putting Data in the Middle

Jill Dyche uses a photo of Paul Allen and Bill Gates as a jumping off point to talk about a data-centric view of the world.

Remarking:

IT departments furtively investing in successive integration efforts, hoping for the latest and greatest “single version of the truth” watch their budgets erode and their stakeholders flee. CIOs praying that their latest packaged application gets traction realize that they’ve just installed yet another legacy system. Executives wake up and admit that the idea of a huge, centralized, behemoth database accessible by all and serving a range of business needs was simply a dream. Rubbing their eyes they gradually see that data is decoupled from the systems that generate and use it, and past infrastructure plays have merely sedated them.

I really like the successive integration efforts line.

Jill offers an alternative to that sad scenario, but you will have to read her post to find out!

HowTo.gov

Filed under: Marketing — Patrick Durusau @ 6:43 pm

HowTo.gov

A window into best information practices for U.S. government agencies. Shape your sales pitch to match those practices.

The General Services Administration (GSA) (well actually, GSA’s Office of Citizen Services & Innovative Technologies) sponsors the site.

I haven’t encountered anything earth shaking or new but it is an attractive site that is designed to assist agency staff with questions about how to better deliver information from or about their agencies.

Definitely a site I would pass along to state, local as well as federal agencies. They will benefit from the information it contains and it will give you a jumping off point for discussion of how you can assist with their information needs.

Take the struggle out of search

Filed under: Findability,Search Interface,Searching — Patrick Durusau @ 6:42 pm

Take the struggle out of search

John Moore writes in Federal Computer Week:

Consistency is generally a good thing, but the Food Safety and Inspection Service’s website established a pattern for its search function no organization wants to own: It was consistently bad.

The agency used a combination of Web analytics and more detailed survey questions to zero in on the problem and discovered what was frustrating some site visitors: They were searching for information that couldn’t be found on the site. FSIS’ food safety purview covers meat, poultry and eggs, but some users were searching for information on vegetable and seafood recalls. Those alerts fall under the Food and Drug Administration.

The problem was solved here by directing visitors off-site to find the appropriate information.

Question: What do you do when visitors ask for information you don’t have?

Say they search for a game title that is by a competing manufacturer? Or a book title from another publisher? Or some other product by “the competition.”

Do you simply return a null result? (Tip for the day – when all else fails, return a useful result)

The article provides the details of and possible solutions to: (not unique to government, survey says half of all commercial businesses lack findability goals, 2008 but I would be willing to bet that hasn’t improved):

  • Problem 1: Poor information architecture
  • Problem 2: Not enough people or expertise
  • Problem 3: Too many government websites
  • Problem 4: Little or no SEO

Is this another data point in the continuing saga of why semantic solutions, including topic maps, face slow uptake?

That most organizations, commercial/governmental/non-profit/etc., lack basic information storage/retrieval skills. Have some very highly skilled people but not enough to do everything. Most of the rest are very willing but lack the skills to make a difference.

Which makes offering advanced information technologies like offering a grade school science fair participant use of the Large Hadron Collider in place of their lost radium sample for a Wilson cloud chamber. May some day be useful to them, but not today.

Suggestion: Use advanced techniques (I would inveigh for topic maps) to create “better” search capabilities for part of an agency website. Can’t really repair poor architecture remotely but probably can minimize its impact. Create a noticeably more useful search experience, such that even agency staff turn to it for some resources. Gives you a calling card with validation to back it up. (You probably also need to hire that recently retired section chief but doing a good job helps as well.)

PS: Just so you know, the first example of antimatter, a positive electron was discovered with a cloud chamber. Stray cosmic ray with enough power for the decay pattern to include a positron. Cloud chamber plans. The start of your education to be able to talk to the folks at the CERN in their own terms.

Graph Theory in Sage

Filed under: Graphs,Mathematics,Sage — Patrick Durusau @ 6:40 pm

Graph Theory in Sage is a presentation by William Stein of some of the graph capabilities of Sage.

I mention it because there has been discussion on the Neo4j mailing list about learning graph theory and this may be helpful in that regard.

There is a Sage worksheet that has all the formulas and values used in the presentation.

You can also download the video.

You will have to experience it for yourself but I thought the help feature on graphs was most impressive.

Sage will help you get your feet on the ground with formal graph theory.

Sage

Filed under: Mathematica,Mathematics,Sage — Patrick Durusau @ 6:38 pm

Sage

Kirk Lowery mentioned Sage to me and with mathematics being fundamental to IR, it seemed like a good resource to mention. Either for research, using one of the course books or satisfying yourself that algorithms operate as advertised.

You don’t have to take someone’s word on algorithms. Use a small enough test case that you will recognize the effects of the algorithm. Or test it against another algorithm said to give similar results.

I saw a sad presentation years ago when a result was described as significant because the manual for the statistics package used said it was significant. Don’t let that be you, either in front of a client or in a presentation to peers.

From the website:

Sage is a free open-source mathematics software system licensed under the GPL. It combines the power of many existing open-source packages into a common Python-based interface.

Mission: Creating a viable free open source alternative to Magma, Maple, Mathematica and Matlab.

From the feature tour:

Sage is built out of nearly 100 open-source packages and features a unified interface. Sage can be used to study elementary and advanced, pure and applied mathematics. This includes a huge range of mathematics, including basic algebra, calculus, elementary to very advanced number theory, cryptography, numerical computation, commutative algebra, group theory, combinatorics, graph theory, exact linear algebra and much more. It combines various software packages and seamlessly integrates their functionality into a common experience. It is well-suited for education and research.

The user interface is a notebook in a web browser or the command line. Using the notebook, Sage connects either locally to your own Sage installation or to a Sage server on the network. Inside the Sage notebook you can create embedded graphics, beautifully typeset mathematical expressions, add and delete input, and share your work across the network.

The following showcase presents some of Sage’s capabilities, screenshots and gives you an overall impression of what Sage is. The examples show the lines of code in Sage on the left side, accompanied by an explanation on the right. They only show the very basic concepts of how Sage works. Please refer to the documentation material for more detailed explanations or visit the library to see Sage in action.

In all fairness to Mathematica, the hobbyist version is only $295 for Mathematica 8. With versions for Windows (XP/Vista/7) Max OS X (Intel) and Linux. There is a reason why people want to be like…some other software. Mathematica has data mining capabilities and a host of other features. I am contemplating a copy of Mathematica as a Christmas present for myself.

Do note that all of the Fortune 50 companies use Mathematica. The hobbyist version allows you to add an important skill set that is relevant to a select clientele. Not to mention various government agencies, etc.

Should a job come along that requires it, I can simply upgrade to a professional license. Why? Well, I expect people to pay my invoices when I submit them. Why shouldn’t I pay for software I use on the jobs that result in those invoices?

Don’t cut corners on software. Same goes for the quality of jobs. It will show. If you don’t know, don’t lie, say you don’t know but will find out. Clients will find simple honesty quite refreshing. (I can’t promise that result for you but it has been the result for me over a variety of professions.)

Indexing Sound: Musical Riffs to Gunshots

Filed under: Indexing,Similarity,Sound — Patrick Durusau @ 6:37 pm

Sound, Digested: New Software Tool Provides Unprecedented Searches of Sound, from Musical Riffs to Gunshots

From the post:

Audio engineers have developed a novel artificial intelligence system for understanding and indexing sound, a unique tool for both finding and matching previously un-labeled audio files.

Having concluded beta testing with one of the world’s largest Hollywood sound studios and leading media streaming and hosting services, Imagine Research of San Francisco, Calif., is now releasing MediaMined™ (http://www.imagine-research.com/node/51) for applications ranging from music composition to healthcare.

….

One of the key innovations of the new technology is the ability to perform sound-similarity searches. Now, when a musician wants a track with a matching feel to mix into a song, or an audio engineer wants a slightly different sound effect to work into a film, the process can be as simple as uploading an example file and browsing the detected matches.

“There are many tools to analyze and index sound, but the novel, machine-learning approach of MediaMined™ was one reason we felt the technology could prove important,” says Errol Arkilic, the NSF program director who helped oversee the Imagine Research grants. “The software enables users to go beyond finding unique objects, allowing similarity searches–free of the burden of keywords–that generate previously hidden connections and potentially present entirely new applications.”

Or from the Imagine Research Applications page:

Organize Sound

Automatically index the acoustic content of video, audio, and live streams across a companies web services. Analyze web-crawled data, user-generated content, professional broadcast content, and live streaming events.

Benefits:

  • Millions of minutes of content are now searchable
  • Recommending related content increases audience and viewer consumption
  • Better content discovery, intelligent navigation within media files
  • Search audio content with ease and accuracy
  • Audio content-aware targeted ads – improves ad performance and revenue

Search Sound

Perform sound-similarity searches for sounds and music by using example sounds.
Search for production music that matches a given track.
Perform the rhythmic similarity searches

Benefits:

  • Recommending related content increases audience and viewer consumption
  • Music/Audio licensing portals provide a unique-selling point: find content based on an input seed track.
  • Improved monetization of existing content with similarity-search and recommendations

And if you have a topic map of music, producers, studios, albums, etc., this could supplement your topic map based on similarity measures every time a new recording is released, or music is uploaded to a website or posted to YouTube. So you know who to contact for whatever reason.

A topic map of videos could serve up matches and links with thumbnails for videos with similar sound content based on a submitted sample, a variation as it were on “more of same.”

A topic map of live news feeds could detect repetition of news stories and with timing information could map the copying of content from one network to the next. Or provide indexing of news accounts without the necessity of actually sitting through the broadcasts. That is an advantage not mentioned above.

Sound recognition isn’t my area so if this is old news or there are alternatives to suggest to topic mappers, please sing out! (Sorry!)

NIST Smart Grid roadmap calls for common data semantics

Filed under: Semantics — Patrick Durusau @ 6:36 pm

NIST Smart Grid roadmap calls for common data semantics

The news account reads in part:

Smart Grid implementation requires a common semantical understanding of data elements, says the National Institute of Standards and Technology in a draft version 2.0 of its framework and roadmap for Smart Grid interoperability standards.

NIST posted the document, dated Oct. 17, online on Oct. 25. It proposes a conceptual model of the Smart Grid as defined by electrical flows and secure communications running between seven main domains: bulk generation, transmission, distribution, markets, operations, service providers and customers.
….
A Smart Grid truly operating as envisioned–as an electrical grid system whose management and use is driven by data produced by all domains–is heavily dependent on the consistency of semantic models, the draft says.

This article is a good reason for not jumping to conclusions based on news reports. The draft V.2 NIST framework and roadmap for Smart Grid interoperability standards (linked to from the article), makes it clear that while common data semantics maybe necessary for a “Smart Grid” implementation, NIST isn’t insane enough to think that is happening. At least any time soon.

From page 8 of the draft:

NIST supported the Commission’s order, which notes that ―In its comments, NIST suggests that the Commission could send appropriate signals to the marketplace by recommending use of the NIST Framework without mandating compliance with particular standards. NIST adds that it would be impractical and unnecessary for the Commission to adopt individual interoperability standards.

Although the NIST framework and roadmap effort is the product of federal legislation, broad engagement of Smart Grid stakeholders at the state and local levels is essential to ensure the consistent voluntary application of the standards being developed. Currently, many states and their utility commissions are pursuing Smart Grid-related projects. Ultimately, state and local projects will converge into fully functioning elements of the Smart Grid ―system of systems. Therefore, the interoperability and cybersecurity standards developed under the NIST framework and roadmap must support the role of the states in modernizing the nation‘s electric grid. The NIST framework can provide a valuable input to regulators as they consider the prudency of investments proposed by utilities.

We are going to “suggest” to people that they adopt a common data semantic?

Just a sip of the one indication of the complexity of the task ahead:

Establish a Smart Grid Interoperability Panel forum to drive longer-term progress. A representative, reliable, and responsive organizational forum is needed to sustain continued development of the framework of interoperability standards. On November 19, 2009, a Smart Grid Interoperability Panel (SGIP) was launched to serve this function and has now grown to over 675 organizations comprising over 1790 members. (NIST objective, page 15)

The objective of this NIST effort isn’t to create a common data semantic, which most data people in the trenches would acknowledge isn’t possible, but rather:

A key objective of the NIST work is to create a self-sustaining, ongoing standards process that supports continuous innovation as grid modernization continues in the decades to come.

So, no common data semantic. A process to talk about what it might be like if a common data semantic existed, assuming it was scoped to even be a sensible thing to talk about.

My heart rate, breathing have returned to normal. How about yours?

Machine Learning (Carnegie Mellon University)

Filed under: Computer Science,CS Lectures,Machine Learning — Patrick Durusau @ 6:33 pm

Machine Learning 10-701/15-781, Spring 2011 Carnegie Mellon University by Tom Mitchell.

Course Description:

Machine Learning is concerned with computer programs that automatically improve their performance through experience (e.g., programs that learn to recognize human faces, recommend music and movies, and drive autonomous robots). This course covers the theory and practical algorithms for machine learning from a variety of perspectives. We cover topics such as Bayesian networks, decision tree learning, Support Vector Machines, statistical learning methods, unsupervised learning and reinforcement learning. The course covers theoretical concepts such as inductive bias, the PAC learning framework, Bayesian learning methods, margin-based learning, and Occam’s Razor. Short programming assignments include hands-on experiments with various learning algorithms, and a larger course project gives students a chance to dig into an area of their choice. This course is designed to give a graduate-level student a thorough grounding in the methodologies, technologies, mathematics and algorithms currently needed by people who do research in machine learning.

I don’t know how other disciplines are faring but for a variety of CS topics, there are enough excellent online materials to complete the equivalent of an undergraduate if not master’s degree in CS.

November 9, 2011

How Common Is Merging?

Filed under: Dataset,Merging,Topic Map Software,Topic Maps — Patrick Durusau @ 7:44 pm

I started wondering about how common merging is in topic maps because I discovered a lack I have not seen before. There aren’t any large test collections of topic maps for CS types to break their clusters against. The sort of thing that challenges their algorithms and hardware.

But test collections should have some resemblance to actual data sets, at least if that is known with any degree of certainty. Or at least be one of the available data sets.

As a first step towards exploring this issue, I grepped for topics in the Opera and CIA Fact Book and got:

  • Opera topic map: 29,738
  • CIA Fact Book: 111,154

for a total of 140,892 topic elements. After merging the two maps, there were 126,204 topic elements. So I count that as merging 14,688 topic elements.

Approximately 10% of the topics in the two sets.

A very crude way to go about this but I was looking for rough numbers that may provoke some discussion and more refined measurements.

I mention that because one thought I had was to simply “cat” the various topic maps at the topicmapslab.de in CTM format together into one file and to “cat” that file until I have 1 million, 10 million and 100 million topic sets (approximately). Just a starter set to see what works/doesn’t work before scaling up the data sets.

Creating the files in this manner is going to result in a “merge heavy” topic map due to the duplication of content. That may not be a serious issue and perhaps better that it be that way in order to stress algorithms, etc. It would have the advantage that we could merge the original set and then project the number of merges that should be found in the various sets.

Suggestions/comments?

Multiperspective

Filed under: Associations,Data Mining,Graphs,Multiperspective,Neo4j,Visualization — Patrick Durusau @ 7:43 pm

Multiperspective

From the readme:

WHAT IS MULTISPECTIVE?

Multispective is an open source intelligence management system based on the neo4j graph database. By using a graph database to capture information, we can use its immensely flexible structure to store a rich relationship model and easily visualize the contents of the system as nodes with relationships to one another.

INTELLIGENCE MANAGEMENT FOR ACTIVISTS AND COLLECTIVES

The main purpose for creating this system is to provide socially motivated groups with an open source software product for managing their own intelligence relating to target networks, such as corporations, governments and other organizations. Multispective will provide these groups with a collective/social mechanism for obtaining and sharing insights into their target networks. My intention is that Multispective’s use of social media paradigms combined with visualisations will provide a well-articulated user interface into working with complex network data.

Inspired by the types of intelligence management systems used by law enforcement and national security agencies, Multispective will be great for showing things like corporate ownership and interest, events like purchases, payments (bribes), property transfers and criminal acts. The system will make it easier to look at how seemingly unrelated information is actually connected.

Multispective will also allow groups to overlap in areas of interest, discovering commonalities between discrete datasets, and being able to make use of data which has already been collected. (emphasis added)

The last two lines would not be out of place in any topic map presentation.

A project that is going to run into subject identity issues sooner rather than later. Experience and suggestions from the topic map camp would be welcome I suspect.

I don’t have a lot of extra time but I am going to toss my hat into the ring as at least interested in helping. How about you?

B2B Blog Strategy | Ten Be’s of The Best B2B Blogs

Filed under: Business Intelligence,Marketing,Topic Maps — Patrick Durusau @ 7:43 pm

B2B Blog Strategy | Ten Be’s of The Best B2B Blogs

Joel York writes:

Blogging is one of the easiest, cheapest and most effective ways to engage the New Breed of B2B Buyer, yet so many B2B blogs miss the mark. Here are ten “be’s” of the best b2b blogs. It isn’t the first top ten list of best B2B blog secrets, and no doubt it will not be the last. But, it is mine and it’s what I personally strive for Chaotic Flow to be.

Joel’s advice will work for topic map blogs as well.

People are not going to find out about topic maps unless we continue to push information about topic maps out into the infosphere. Blogging is one aspect of pushing information. Tweeting is another. Publication of white papers, software and other materials is another.

The need for auditable, repeatable, reliable consolidation (if you don’t like the merging word) of information from different sources is only growing with the availability of more data on the Internet. I think topic maps has a role to play there. Do you?

Connecting the Dots: An Introduction

Filed under: Business Intelligence,Data Management,Data Models — Patrick Durusau @ 7:43 pm

Connecting the Dots: An Introduction

A new series of posts by Rick Sherman who writes:

In the real world the situations I discuss or encounter in enterprise BI, data warehousing and MDM implementations lead me to the conclusion that many enterprises simply do not connect the dots. These implementations potentially involve various disciplines such as data modeling, business and data requirements gathering, data profiling, data integration, data architecture, technical architecture, BI design, data governance, master data management (MDM) and predictive analytics. Although many BI project teams have experience in each of these disciplines they’re not applying the knowledge from one discipline to another.

The result is knowledge silos where the the best practices and experience from one discipline is not applied in the other disciplines.

The impact is a loss in productivity for all, higher long-term costs and poorly constructed solutions. This often results in solutions that are difficult to change as the business changes, don’t scale as the data volumes or numbers of uses increase, or is costly to maintain and operate.

Imagine that, knowledge silos in the practice of eliminating knowledge silos.

I suspect that reflects the reality that each of us is a model of a knowledge silo. There are areas we like better than others, areas we know better than others, areas where we simply don’t have the time to learn. But when asked for an answer to our part of a project, we have to have some answer, so we give the one we know. Hard to imagine us doing otherwise.

We can try to offset that natural tendency by reading broadly, looking for new areas or opportunities to learn new techniques, or at least have team members or consultants who make a practice out of surveying knowledge techniques broadly.

Rick promises to show how data modeling is disconnected from the other BI disciplines in the next Connecting the Dots post.

Tiny Trilogy

Filed under: Business Intelligence,Data Warehouse — Patrick Durusau @ 7:43 pm

Tiny Trilogy

Peter Thomas writes:

Although tinyurl.com was a pioneer in URL shortening, it seems to have been overtaken by a host of competing services. For example I tend to use bit.ly most of the time. However I still rather like the tinyurl.com option to create your own bespoke shortened URLs.

This feature rather came into its own recently when I was looking for a concise way to share my recent trilogy focusing on the use of historical data to justify BI/DW investments in Insurance.

Good series of posts on historical data and business intelligence. I suspect many of these lessons could be applied fairly directly to using historical data to justify semantic integration projects.

Such as showing what sharing would have meant as far as information on terrorists prior to 9/11.

Microsoft Business Intelligence (BI) Resources

Filed under: Business Intelligence,Microsoft — Patrick Durusau @ 7:42 pm

Microsoft Business Intelligence (BI) Resources

Dan English posts a number of MS BI resources from a PowerView session.

I don’t have access to an MS Server environment so you will have to evaluate these resources on your own.

For historical reasons I have mostly worked in *nix server environments. I have never really been tempted to experiment with MS server products, although I must confess I have had my share of laptops/desktops that ran Windows software. (I have *nix and Windows boxes sharing monitors/keyboard even now.)

With hardware prices where they are, perhaps I should setup a Windows server box (behind my firewall, etc.) so I can test some of these applications.

Wondering what it would take to put subject identity tests, a semantic shim as it were on top of these products to offer some enhanced value to their users? True enough, if popular MS would absorb it in a future release but isn’t that what progress is about?

Accel Partners announces $100M Big Data Fund…

Filed under: Conferences,Funding — Patrick Durusau @ 7:42 pm

Accel Partners announces $100M Big Data Fund — to invest in Hadoop, NoSQL and other cool stuff

From the post:

Venture firm Accel Partners has carved out a $100 million “big data” fund to invest in companies focused on building new IT infrastructure or on applications than run on that new infrastructure.

Accel, based in Palo Alto, Calif., at the heart of Silicon Valley’s venture capital community, has invested in companies like Facebook, Dropbox, Cloudera and Etsy.

As such, the firm has seen how companies like Facebook have been forced to exploit new technologies to store and analyze their huge amounts of data more efficiently. In Facebook’s case, it has used open source project Hadoop to help it process the billions of messages it receives each day. NoSQL database technology is another way companies have become more efficient in storing data.

All big Web companies, including Google, Yahoo and Twitter, and increasingly large enterprise companies, are building applications on these platforms.

Ping Li, the partner at Accel (pictured right) who has led the firm’s investments in in companies such as Cloudera — which commercialized the Hadoop technology — said the new fund will be invested in two types of companies: (1) companies building out the new infrastructure, including in storage, security and management; and (2) companies building applications on top of that infrastructure (spanning, for example, business intelligence, collaboration, mobile and vertical apps).

He said these companies will span just about every sector, from enterprise to gaming — all of which will require new kinds of data-intensive platforms, he said. Investments will be made globally, he added.

Over the last 30 years, legacy data platforms, including relational databases, drove the emergence of significant companies like Oracle, SAP and Symantec, Li said. Likewise, big data will usher in a new era of multi-billion software companies, Li says.

The firm has carved out the $100 million from its existing funds, so this does not represent a fresh dollop of cash, Li said.

Accel also plans to host a “big data” conference in Spring, 2012, to drive discussion on technology trends in the sector, Li said.

You may not get part of this $100 million but attempting to do so will be good practice for next time.

I will keep watch for the Spring 2012 conference.

Redis: Zero to Master in 30 minutes – Part 1

Filed under: NoSQL,Redis — Patrick Durusau @ 7:41 pm

Redis: Zero to Master in 30 minutes – Part 1

From the post:

More than once, I’ve said that learning Redis is the most efficient way a programmer can spend 30 minutes. This is a testament to both how useful Redis is and how easy it is to learn. But, is it true, can you really learn, and even master, Redis in 30 minutes?

Let’s try it. In this part we’ll go over what Redis is. In the next, we’ll look at a simple example. Whatever time we have left will be for you to set up and play with Redis.

This is a nice post. Introduces enough of Redis for you to get some idea of its power without being overwhelming with details. Continues with Part 2 by the way.

Apache Mahout: Scalable machine learning for everyone

Filed under: Amazon Web Services AWS,Mahout — Patrick Durusau @ 7:41 pm

Apache Mahout: Scalable machine learning for everyone by Grant Ingersoll.

Summary:

Apache Mahout committer Grant Ingersoll brings you up to speed on the current version of the Mahout machine-learning library and walks through an example of how to deploy and scale some of Mahout’s more popular algorithms.

A short summary to a twenty-three (23) page paper that concludes with two (2) pages of pointers to additional resources!

You will learn a lot about Mahout and Amazon Web Services (EC2).

« Newer PostsOlder Posts »

Powered by WordPress