Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 6, 2011

Real-time Streaming Analysis for Hadoop and Flume

Filed under: Flume,Hadoop,Interface Research/Design — Patrick Durusau @ 6:52 pm

Real-time Streaming Analysis for Hadoop and Flume

From the description:

This talk introduces an open-source SQL-based system for continuous or ad-hoc analysis of streaming data built on top of the Flume data collection platform for Hadoop.

Big data analytics based on Hadoop often require aggregating data in a large data store like HDFS or HBase, and then running periodic MapReduce processes over this data set. Getting “near real time” results requires running MapReduce jobs more frequently over smaller data sets, which has a practical frequency limit based on the size of the data and complexity of the analytics; the lower bound on analysis latency is on the order of minutes. This has spawned a trend of building custom analytics directly into the data ingestion pipeline, enabling some streaming operations such as early alerting, index generation, or real-time tuning of ad systems before performing less time-sensitive (but more comprehensive) analysis in MapReduce.

We present an open-source tool which extends the Flume data collection platform with a SQL-like language for analysis over streaming event-based data sets. We will discuss the motivation for the system, its architecture and interaction with Flume, potential applications, and examples of its usage.

Deeply awesome! Just wish I had been present to see the demo!

Makes me think of topic map creation from data streams with the ability to test different subject identity merging conditions, in real time. Rather than repetitive stories about a helicopter being downed, you get a summary report and a listing by location and time of publication of repetitive reports. Say one screen full of content and access to the noise. Better use of your time?

Machine learning problem settings

Filed under: Hadoop,Machine Learning,MapReduce — Patrick Durusau @ 6:51 pm

Machine learning problem settings

From the post:

After a few successful Apache Mahout projects the goal of this lecture was to introduce students to some of the basic concepts and problems encountered today in a world where huge datasets are generally available and are easy to process with Apache Hadoop. As such the course is targeted at an entry level audience – thorough treatment of the mathematical background of latest machine learning technology is left to the machine learning research groups in Potsdam, at TU Berlin and the neural information processing group at TU.

Slides and exercises that will be useful along side of or getting warmed up for Introduction to Artificial Intelligence – Stanford Class.

August 5, 2011

A Storm is coming: more details and plans for release

Filed under: NoSQL,Storm — Patrick Durusau @ 7:07 pm

A Storm is coming: more details and plans for release

Storm is going to be released at Strange Loop on September 19!

From the post:

Here’s a recap of the three broad use cases for Storm:

  1. Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
  2. Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
  3. Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.

The beauty of Storm is that it’s able to solve such a wide variety of use cases with just a simple set of primitives.

The really exciting part about all the current frenzy of development is imagining where it is going to be five (5) years from now.

Sentiment Analysis: Machines Are Like Us

Filed under: Analytics,Artificial Intelligence,Classifier,Machine Learning — Patrick Durusau @ 7:07 pm

Sentiment Analysis: Machines Are Like Us

Interesting post but in particular for:

We are very aware of the importance of industry-specific language here at Brandwatch and we do our best to offer language analysis that specialises in industries as much as possible.

We constantly refine our language systems by adding newly trained classifiers (a classifier is the particular system used to detect and analyse the language of a query’s matches – which classifier should be used is determined upon query creation).

We have over 500 classifiers for different industries across the 17 languages we cover.

Did you catch that? Over 500 classifiers for different industries.

In other words, we don’t need a single classifier that does all the heavy lifting on entity recognition for building topic maps. We could, for example, train a classifier for use with all the journals in a field or sub-field. For astronomy, for example, we don’t have to disambiguate all the various uses of “Venus” but can concentrate on the one most likely to be found in a sub-set of astronomy literature.

By using specialized classifiers, perhaps we can reduce the target for more generalized classifiers to a manageable size.

Topic “Flow” Map?

Filed under: Mapping,Maps,Visualization — Patrick Durusau @ 7:06 pm

The Economist’s Twitter followers click links, Al Jazeera’s retweet, study finds

This article uses “topic map” in the sense of:

Timing and topical interest matter when seeking attention. By arranging audience tweets into topic maps, we were able to visualise the flow of attention between topics of interest, across the different audiences.

The “topic map” is of @AJENGLISH Audience for an hour.

You can see the full article at: Engaging News Hungry Audiences Tweet by Tweet: An audience analysis of prominent mainstream media news accounts on Twitter

The full article includes another “topic map” of Fox.

As a publisher, I would be interested in what terminology I could use to reach other audiences, perhaps at other times. Doing that mapping of identifications, would require a more traditional topic map.

If you were clever with it, that could result in real-time tracking of different memes for the same subjects across user streams such as Facebook and Twitter. From tracking it is just a short step to modeling and then influencing those memes.

Mahout: Hands on!

Filed under: Artificial Intelligence,Hadoop,Machine Learning,Mahout — Patrick Durusau @ 7:06 pm

Mahout: Hands on!

From the tutorial description at OSCON 2011:

Mahout is an open source machine learning library from Apache. At the present stage of development, it is evolving with a focus on collaborative filtering/recommendation engines, clustering, and classification.

There is no user interface, or a pre-packaged distributable server or installer. It is, at best, a framework of tools intend to be used and adapted by developers. The algorithms in this “suite” can be used in applications ranging from recommendation engines for movie websites to designing early warning systems in credit risk engines supporting the cards industry out there.

This tutorial aims at helping you set up Mahout to run on a Hadoop setup. The instructor will walk you through the basic idea behind each of the algorithms. Having done that, we’ll take a look at how it can be run on some of the large-sized datasets and how it can be used to solve real world problems.

If your site or smartphone app or viral facebook app collects data which you really want to use a lot more productively, this session is for you!

Not the only resource on Mahout you will want but an excellent place to start.

August 4, 2011

NCBI Handbook

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:20 pm

NCBI Handbook

From the website:

Bioinformatics consists of a computational approach to biomedical information management and analysis. It is being used increasingly as a component of research within both academic and industrial settings and is becoming integrated into both undergraduate and postgraduate curricula. The new generation of biology graduates is emerging with experience in using bioinformatics resources and, in some cases, programming skills.

The National Center for Biotechnology Information (NCBI) is one of the world’s premier Web sites for biomedical and bioinformatics research. Based within the National Library of Medicine at the National Institutes of Health, USA, the NCBI hosts many databases used by biomedical and research professionals. The services include PubMed, the bibliographic database; GenBank, the nucleotide sequence database; and the BLAST algorithm for sequence comparison, among many others.

Although each NCBI resource has online help documentation associated with it, there is no cohesive approach to describing the databases and search engines, nor any significant information on how the databases work or how they can be leveraged, for bioinformatics research on a larger scale. The NCBI Handbook is designed to address this information gap.

An extraordinary resource for learning about bioinformatics information sources.

Pragmatic Philosophical Technology for Text Mining

Filed under: Subject Identity,Topic Maps — Patrick Durusau @ 6:16 pm

Pragmatic Philosophical Technology for Text Mining

Mathew Hurst writes:

In text mining applications, we often work with some form of raw input (web pages, web sites, emails, etc.) and attempt to organize it in terms of the concepts that are mentioned or introduced in the documents.

This process of interpretation can take the form of ‘normalization’ or ‘canonicalization’ (in which many expressions are associated with a singular expression as an exemplar of an set). This happens, for examples, when we map ‘Barack Obama’, ‘President Obama’, etc. to a unique string ‘President Barack Obama’. This is convenient when we want to retrieve all documents about the president.

In this process, we are associating elements within the same language (language in the sense of sets of symbols and the rules that govern their legal generation).

Another approach is to map (or associate) the terms in the original document with some structured record. For example, we might interpret the phrase ‘Starbucks’ as relating to a record of key value pairs {name=starbucks, address=123 main street, …}. In this case, the structure of the record has a semantics (or model) other than that of the original document. In other words, we are mapping from one language to another.

Of course, what we want to do is denote the thing in the real world. It is, however, impossible to represent this as all we can do is shuffle bits around inside the computer. We can’t attach a label to the real world and somehow transcend the reality/representation barrier. However, we can start to look at the modeling process with some pragmatics.

Wrestling with subject identity issues. Worth your time to read and comment.

Introduction to Artificial Intelligence – Stanford Class

Filed under: Artificial Intelligence — Patrick Durusau @ 6:15 pm

Introduction to Artificial Intelligence – Stanford Class

Online enrollment ends September 10, 2011!

From the website:

A free, online version of “Introduction to Artificial Intelligence”, taught by Sebastian Thrun and Peter Norvig. A syllabus and more information about the Stanford course is available here.

How the class will work:
(video at Stanford site)

The class runs from Sept 26 through Dec 16, 2011. While this class is being offered online, it is also taught at Stanford University, where it continues to be a popular intro-level class on AI. For the online version, the instructors aim to offer identical materials, assignments, and exams, and to use the same grading criteria. Both instructors will be available for online discussions.

This looks like a lot of fun and you might also learn something!

Access to Artificial Intelligence: A Modern Approach is suggested.

Kindle version is about 29% off list price.

August 3, 2011

Design: Build the Mobile Gov Toolkit

Filed under: eGov,Government Data,Marketing,Mobile Gov — Patrick Durusau @ 7:39 pm

Design: Build the Mobile Gov Toolkit

Tim O’Reilly tweeted this link.

Deadline for comments: 2 September 2011

From the post:

Your recommendations will help build an open, dynamic toolset–on a public wiki–to help agencies create and implement citizen-centric mobile gov services.

We are focusing on five areas.

  1. Policies: Tell us about policy gaps or ideas to support building mobile programs.
  2. Practices: What would jumpstart your efforts? Templates? Standards? Examples? Can you share your templates, standards, business cases?
  3. Partnerships: With whom and how can we work together?
  4. Products: What are your ideas for apps, mobile sites, text programs, mashups?
  5. Promotions: What are some great ways to spread the word?
  6. Do you have another category? You can add that, too.

What should we tell them about topic maps?

Getting Started with Neo4j

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:39 pm

Getting Started with Neo4j by Andreas Kollegger.

Presentation by Andreas Kollegger introducing Neo4j. If you haven’t looked at Neo4j, this is a good way to catch up.

I particularly liked the “buy you a beer” angle on getting people to come up later for further discussion.

Transition between slides and the command interface hangs a couple of times but not too much.

Overall a very good presentation.

UK Government Paves Way for Data-Mining

Filed under: Authoring Topic Maps,Data Mining,Marketing — Patrick Durusau @ 7:37 pm

UK Government Paves Way for Data-Mining

Blog report on interesting UK government policy report.

From the post:

The key recommendation is that the Government should press at EU level for the introduction of an exception to current copyright law, allowing “non-consumptive” use of a work (ie a use that doesn’t directly trade on the underlying creative and expressive purpose of the work). In the process of text-mining, copying is only carried out as part of the analysis process – it is a substitute for a human reading the work, and therefore does not compete with the normal exploitation of the work itself – in fact, as the paper says, these processes actually facilitate a work’s exploitation (ie by allowing search, or content recommendation). (emphasis in original)

If you think of topic maps as a value-add on top of information stores, allowing “non-consumptive” access would be a real boon for topic maps.

You could create a topic map into copyrighted material and the user of your topic map could access that material only if say they were a subscriber to that content.

As Steve Newcomb has argued on many occasions, topic maps can become economic artifacts in their own right.

Optimizing Distributed Read Operations in VoltDB

Filed under: NoSQL,VoltDB — Patrick Durusau @ 7:37 pm

Optimizing Distributed Read Operations in VoltDB

From the post:

Many VoltDB applications, such as gaming leader boards and real-time analytics, use multi-partition procedures to compute consistent global aggregates (and other interesting statistics). It’s challenging to efficiently process distributed reads operations, especially for performance sensitive applications. Based on feedback from our users, we in VoltDB engineering have been enhancing the VoltDB SQL planner over the last few releases to improve this capability.

Executing global aggregates efficiently requires calculating sub-results at each partition replica and combining the sub-results at a coordinating partition to produce the final result. For example, to calculate a total sum, the VoltDB planner should produce a sub-total at each partition and then sum the sub-totals at the coordinator node. All of this work must be transparent to the application, of course.

Hmmm, “global aggregates,” doesn’t that sound familiar? I realize here is means summing up the number of “kills,” “votes,” etc., simple number stuff but in principal, what you return and how you sum it I would think is application specific. Yes?

Consistency or Bust: Breaking a Riak Cluster

Filed under: NoSQL,Riak — Patrick Durusau @ 7:36 pm

Consistency or Bust: Breaking a Riak Cluster by Jeff Kirkell.

Not your usual slidedeck.

Has enough examples and working instructions for you to actually learn something separate from the presentation.

Perhaps the one time you will be glad someone broke the rule about not putting text for the audience to read on a slide.

August 2, 2011

XBRL Challenge ($20K Prize)

Filed under: Funding,Marketing,Topic Maps,XBRL — Patrick Durusau @ 7:54 pm

XBRL Challenge ($20K Prize)

OK, I admit that after the US budget debate, $20K doesn’t sound like a lot of money. 😉 But, think of the prestige, groupies, etc., that would go along with winning first place.

From the website:

Over 1770 companies have already filed XBRL-formatted financial statements to the SEC and by year-end 2011, all public companies will be doing so. While several XBRL-enabled tools are available on the marketplace today, we’ve created the XBRL Challenge to encourage the development of more tools and build awareness among analysts about the wealth of data available to them.

The XBRL Challenge is a contest that invites participants to contribute open source analytical applications for investors that leverage corporate XBRL data.

Here is the short description of what they are looking for:

Tools that rely on XBRL data, e.g., tool that extracts data for multi-company comparison via desktop application; or one that creates real-time valuation measures and delivers to mobile devices.

I am going to check out the rules and existing apps.

See you near the winner’s circle?

Announcing the Digital Science Catalyst Prize

Filed under: Funding — Patrick Durusau @ 7:54 pm

Announcing the Digital Science Catalyst Prize

From the website:

Since we launched in December 2010, the Digital Science team has been hard at work not only crafting our own software solutions for science, but also engaging with and supporting a range of start-ups and innovators who are working to make research more effective. Today we’re launching a new initiative set to push that engagement and investment in technology one step further.

We’re thrilled to unveil the Catalyst Prize – a programme designed to unleash the most promising new ideas for software in science. This provides grants up to £15,000 (around $24,000) each to fund exciting innovations, and to take them from concept to prototype. They also come with the opportunity to work with the Digital Science team to help refine, develop and promote the innovation. In this way we hope to lower barriers and foster greater creativity in information-technology solutions for science.

The process is simple and fast. Applicants are asked to submit a short proposal detailing their idea and the impact they envision it having in scientific research. Those who pass an initial screening are then asked to present their idea, preferably in person at a Digital Science office (in London, New York or Tokyo), following which a final decision is made. Applications are open now and are accepted at any time.

Sounds like a plan to me!

BTW, saw this on a tweet by Tim O’Reilly.

Neo4j 1.4.1 “Kiruna Stol” GA

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:53 pm

Neo4j 1.4.1 “Kiruna Stol” GA

From the website:

In the last few weeks since we announced Neo4j 1.4 GA, we’ve been busy working on improvements to the codebase for more predictability, better backup performance, and improved scripts for the server. Ordinarily we’d roll these improvements into a milestone, but this time around we think they’re important enough to warrant a stable release, and so today we’re announcing the release of Neo4j 1.4.1 GA.

International QSAR Foundation

Filed under: Cheminformatics,QSAR — Patrick Durusau @ 7:52 pm

International QSAR Foundation

From the website:

The International QSAR Foundation is the only nonprofit research organization devoted solely to creating alternative methods for identifying chemical hazards without further laboratory testing.

We develop, implement and support new QSAR technologies for use in regulation, research and education or wherever testing animals with chemicals is now required. QSAR models predict chemical behavior directly from chemical structure and simulate adverse effects in cells, tissues and lab animals.

When combined with other alternative test methods, QSAR can minimize the the need for animal tests while improving safe use of drugs and other chemicals. (emphasis added)

Subject identification by predicted behavior anyone?

QSAR Toolbox

Filed under: Cheminformatics,QSAR — Patrick Durusau @ 7:52 pm

QSAR Toolbox

From the website:

The category approach used in the Toolbox:

  • Focuses on intrinsic properties of chemicals (mechanism or mode of action, (eco-)toxicological effects).
  • Allows for entire categories of chemicals to be assessed when only a few members are tested, saving costs and the need for testing on animals.
  • Enables robust hazard assessment through mechanistic comparisons without testing.

The QSAR Toolbox is a software intended to be used by governments, the chemical industry and other stakeholders to fill gaps in (eco-)toxicity data needed for assessing the hazards of chemicals. The Toolbox incorporates information and tools from various sources into a logical workflow. Grouping chemicals into chemical categories is crucial to this workflow.

August 1, 2011

Open.NASA

Filed under: Astroinformatics,Data Source — Patrick Durusau @ 3:56 pm

Open.NASA

NASA has shared data and software for years but now has a shiny new website and to be fair, some introductions to make sure of the material easier.

I don’t have a citation for it but Jim Grey (MS) was reported to say that astronomy data was great because there was so much of it and it was free.

There is a lot of mapping possible twixt and tween astronomy data sets, both historic and recent, so it is a ripe area for exploration with topic maps.


Update:

NASA’s Open Government Site Built On Open Source, an InformationWeek post on the NASA site.

Why InformationWeek mentions Object Oriented Data Technology (OODT) and Disqus but provides no links to the same, I cannot say.

Admittedly I don’t do enough linking for concepts, etc., but I do try to put in links to projects and the like.

99 Problems, But the Search Ain’t One

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:55 pm

99 Problems, But the Search Ain’t One

A fairly comprehensive overview of elasticsearch, including replication/sharding and API summaries.

Depending on the type of search and “aggregation” (read merging) you require, this may fit the bill.

Neo4j Spatial

Filed under: Geographic Data,Neo4j — Patrick Durusau @ 3:54 pm

Neo4j Spatial – GIS for the rest of us by Peter Neubauer.

Impressive demonstration of the power of Neo4j!

Watch the slide deck and then see: Neo4j Spatial for more details.

And then, Neo4j Spatial Blog Ideas.

Now couple the idea of merging other data sources, say for example traffic, fire, or other public reports, current or historical onto a map. The potential to create or possibly prevent disruption of services seems unlimited.

Pig with Cassandra: Adventures in Analytics

Filed under: Cassandra,Pig,Pygmalion — Patrick Durusau @ 3:54 pm

Pig with Cassandra: Adventures in Analytics

Suggestions for slide 6 that reads in part:

Pygmalion

Figure in Greek Mythology, sounds like Pig

True enough but in terms of a control language, the play Pygmalion by Shaw would have been the better reference.

I presume the reader/listener would get the sound similarity without prompting.

Sorry, read the slide deck and see the source code at: https://github.com/jeromatron/pygmalion/.

STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation

Filed under: Graphs,Parallel Programming,STINGER — Patrick Durusau @ 3:52 pm

STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation by David A. Bader, Georgia Institute of Technolgy; Jonathan Berry, Sandia National Laboratories; Adam Amos-Binks, Carleton University, Canada; Daniel Chavarrıa-Miranda, Pacific Northwest National Laboratory; Charles Hastings, Hayden Software Consulting, Inc.; Kamesh Madduri, Lawrence Berkeley National Laboratory; and, Steven C. Poulos, U.S. Department of Defense. Dated May 9, 2009.

Abstract:

In this document, we propose a dynamic graph data structure that can serve as a common data structure for multiple real-world applications. The extensible representation for dynamic complex networks is space-efficient, allows parallelism over vertices and edges independently, and can be used for efficient checkpoint/restart of the data.

Describes a deeply interesting data structure for graphs that can be used on different frameworks.

See the Stinger wiki page (with source code as attachments).

And, see D. Ediger, K. Jiang, J. Riedy, and D.A. Bader, “Massive Streaming Data Analytics: A Case Study with Clustering Coefficients,” 4th Workshop on Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 23, 2010.

Abstract:

We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 224 ≈ 16 million vertices and 229 ≈ 537 million edges, the brute-force method processes a mean of over 50,000 updates per second and our Bloom filter approaches 200,000 updates per second.

The authors refer to their approach as “massive streaming data analytics“. I think you will agree.

OK, admittedly they used a Cray XMT. But, such processing power will be available the average site sooner than you think. Soon enough that reading along these lines will put you ahead of the next curve.

UnQL

Filed under: JSON,Query Language — Patrick Durusau @ 3:51 pm

UnQL

From the webpage:

UnQL means Unstructured Query Language. It’s an open query language for JSON, semi-structured and document databases.

Another query language. Thoughts?

Christos Faloutsos: Mining Billion-Node Graphs

Filed under: Graphs — Patrick Durusau @ 3:51 pm

Christos Faloutsos: Mining Billion-Node Graphs

From Daniel Tunkelang at The Noisy Channel:

As promised, here is a video of CMU professor Christos Faloutsos‘s recent tech talk at LinkedIn on “Mining Billion-Node Graphs“. Enjoy!

And check out our next week’s open tech talk by Sreenivas Gollapudi of Microsoft Research on “A Framework for Result Diversification in Search“.

TinkerPop – New Releases

Filed under: Blueprints,Frames,Graphs,Gremlin,Pipes,Rexster,TinkerPop — Patrick Durusau @ 3:51 pm

Good news from Marko Rodriguez:

TinkerPop just released a new round of stable releases.

Blueprints 0.9 (Mavin) – https://github.com/tinkerpop/blueprints/wiki/Release-Notes

Pipes 0.7 (PVC) – https://github.com/tinkerpop/pipes/wiki/Release-Notes

Frames 0.4 (Studs) – https://github.com/tinkerpop/frames/wiki/Release-Notes

Gremlin 1.2 (New Sheriff in Town) – https://github.com/tinkerpop/gremlin/wiki/Release-Notes

Rexster 0.5 (Dog Star) – https://github.com/tinkerpop/rexster/wiki/Release-Notes

Here is the main points with each release:

  • Blueprints:
    • Vertex API changed so now you have Vertex.getInEdges(String… labels) and Vertex.getOutEdges(String… labels)
    • Heavy development on GraphSail which turns any IndexableGraph into Sail RDF store
  • Pipes:
    • Introduced PipeClosure pattern which allows for closure-based pipes in native Java
    • Migrated all “Gremlin-specific pipes” (closure-based) to Pipes
    • Opening up the stage for data flow traversal languages for any JVM language
  • Frames:
    • Added helper interfaces VertexFrame and EdgeFrame
  • Gremlin:
    • Support the easy definition of new steps with
      Gremlin.defineStep()
    • Mass migration of all “Gremlin-specific pipes” to Pipes
    • Support for processing closures in aggregate, groupCount, and paths
  • Rexster:
    • Added RexPro (the future foundation for the Rexster’s multi-protocol infrastructure).
    • Added rexster-console.sh (RexsterConsole) to allow remote “mysql>”-style interactions via any JSR 223-based JVM language
    • JSON serialization inherited from Blueprints (consistent throughout TinkerPop stack)

OrientDB v1.0rc4

Filed under: Graphs,OrientDB — Patrick Durusau @ 3:50 pm

OrientDB v1.0rc4

In case you haven’t read about OrientDB before:

OrientDB is a new Open Source NoSQL DBMS born with the best features of all the others. It’s written in Java and it’s amazing fast: can store up to 150,000 records per second on common hardware. Even if it’s Document based database the relationships are managed as in Graph Databases with direct connections among records. You can travere entire or part of trees and graphs of records in few milliseconds. Supports schema-less, schema-full and schema-mixed modes. Has a strong security profiling system based on user and roles and support the SQL between the query languages. Thank to the SQL layer it’s straightforward to use it for people skilled in Relational world.

The list of latest changes.

From the latest announcement:

Please help OrientDB to be more famous by writing a short review in your Blog, Magazines and Mailing Lists. The magic formula is: More users = More test = More stable = More support (drivers, plugins, etc).

That’s clear enough! 😉

Create Hadoop clusters the easy peasy way with Pallet

Filed under: Hadoop — Patrick Durusau @ 3:50 pm

Create Hadoop clusters the easy peasy way with Pallet

From the post:

Setting up a Hadoop cluster is usually a pretty involved task. There are certain rules about how the cluster is to be configured. These rules need to be followed strictly for the cluster to work. For example, some nodes need to know how to talk to the other nodes, and some nodes need to allow other nodes to talk to them. Go ahead and check out the official instructions, or this more detailed tutorial on setting up multi-node Hadoop clusters. In this article we describe a solution that will create a fully functional hadoop cluster on any public cloud with very few steps, and in a very flexible way.

Anyone with a cloud account have any comments on this approach to creating Hadoop clusters?

International Bibliographic Standards, Linked Data, and the Impact on Library Cataloging

Filed under: Conferences,FRBR,Linked Data — Patrick Durusau @ 3:49 pm

International Bibliographic Standards, Linked Data, and the Impact on Library Cataloging

Webinar
August 24, 2011
1:00 – 2:30 p.m. (Eastern Time)

From the notice:

The International Federation of Library Associations and Institutions (IFLA) is responsible for the development and maintenance of International Standard Bibliographic Description (ISBD), UNIMARC, and the “Functional Requirements” family for bibliographic records (FRBR), authority data (FRAD), and subject authority data (FRSAD). ISBD underpins the MARC family of formats used by libraries world-wide for many millions of catalog records, while FRBR is a relatively new model optimized for users and the digital environment. These metadata models, schemas, and content rules are now being expressed in the Resource Description Framework language for use in the Semantic Web.

This webinar provides a general update on the work being undertaken. It describes the development of an Application Profile for ISBD to specify the sequence, repeatability, and mandatory status of its elements. It discusses issues involved in deriving linked data from legacy catalogue records based on monolithic and multi-part schemas following ISBD and FRBR, such as the duplication which arises from copy cataloging and FRBRization. The webinar provides practical examples of deriving high-quality linked data from the vast numbers of records created by libraries, and demonstrates how a shift of focus from records to linked-data triples can provide more efficient and effective user-centered resource discovery services.

This not a free webinar but registration means if you miss it on the 24th of August, you will still have access to the recorded proceedings for one year.

« Newer Posts

Powered by WordPress