Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 11, 2013

unicodex — High-performance Unicode Library (C++)

Filed under: Software,Unicode — Patrick Durusau @ 11:42 am

unicodex — High-performance Unicode Library (C++) by Dustin Juliano.

From the post:

The following is a micro-optimized Unicode encoder/decoder for C++ that is capable of significant performance, sustaining 6 GiB/s for UTF-8 to UTF-16/32 on an AMD A8-3870 running in a single thread, and 8 GiB/s for UTF-16 to UTF-32. That would allow it to encode nearly the full English Wikipedia in approximately 6 seconds.

It maps between UTF-8, UTF-16, and UTF-32, and properly detects UTF-8 BOM and the UTF-16 BOMs. It has been unit tested with gigabytes of data and verified with binary analysis tools. Presently, only little-endian is supported, which should not pose any significant limitations on use. It is released under the BSD license, and can be used in both proprietary and free software projects.

The decoder is aware of malformed input and will raise an exception if the input sequence would cause a buffer overflow or is otherwise fatally incorrect. It does not, however, ensure that exact codepoints correspond to the specific Unicode planes; this is by design. The implementation has been designed to be robust against garbage input and specifically avoid encoding attacks.

One of those “practical” things that you may need for processing topic maps and or other digital information. 😉

January 29, 2013

The Data Science Toolkit is now on Vagrant!

Filed under: Data Science,Software,Vagrant — Patrick Durusau @ 6:51 pm

The Data Science Toolkit is now on Vagrant! by Pete Warden.

From the post:

I have fallen in love with Vagrant over the last year, it turns an entire logical computer as a single unit of software. In simple terms, you can easily set up, run, and maintain a virtual machine image with all the frameworks and data dependencies pre-installed. You can wipe it, copy it to a different system, branch it to run experimental changes, keep multiple versions around, easily share it with other people, and quickly deploy multiple copies when you need to scale up. It’s as revolutionary as the introduction of distributed source control systems, you’re suddenly free to innovate because mistakes can be painlessly rolled back, and you can collaborate other people without worrying that anything will be overwritten.

Before I discovered Vagrant, I’d attempted to do something similar with my Data Science Toolkit package, distributing a VMware image of a full linux system with all the software and data it required pre-installed. It was a large download, and a lot of people used it, but the setup took more work than I liked. Vagrant solved a lot of the usability problems around downloading VMs, so I’ve been eager to create a compatible version of the DSTK image. I finally had a chance to get that working over the weekend, so you can create your own local geocoding server just by running:

vagrant box add dstk http://static.datasciencetoolkit.org/dstk_0.41.box

vagrant init

The box itself is almost 5GB with all the address data, so the download may take a while. Once it’s done go to http://localhost:8080 and you’ll see the web interface to the geocoding and unstructured data parsing functions.

Based on Oracle’s VirtualBox, this looks like a very cool way to distribute topic map applications with data.

Remember the Emulate Drug Dealers [Marketing Topic Maps] post?

I was very serious.

January 24, 2013

11 Interesting Releases From the First Weeks of January

Filed under: NoSQL,Software — Patrick Durusau @ 8:10 pm

11 Interesting Releases From the First Weeks of January by Alex Popescu.

Alex has collected links for eleven (11) interesting NoSQL releases in January 2013!

Visit Alex’s post. You won’t be disappointed.

January 23, 2013

Testling-CI

Filed under: Interface Research/Design,Software,Web Applications,Web Browser — Patrick Durusau @ 7:41 pm

Announcing Testling-CI by Peteris Krumins.

From the post:

We at Browserling are proud to announce Testling-CI! Testling-CI lets you write continuous integration cross-browser tests that run on every git push!

testling-ci

There are a ton of modules on npm and github that aren’t just for node.js but for browsers, too. However, figuring out which browsers these modules work with can be tricky. It’s often that case that some module used to work in browsers but has accidentally stopped working because the developer hadn’t checked that their code still worked recently enough. If you use npm for frontend and backend modules, this can be particularly frustrating.

You will probably also be interested in: How to write Testling-CI tests.

A bit practical for me but with HTML5, browser-based interfaces are likely to become the default.

Useful to point out resources that will make it easier to cross-browser test topic map, browser-based interfaces.

January 12, 2013

13 Things People Hate about Your Open Source Docs

Filed under: Documentation,Open Source,Software — Patrick Durusau @ 7:06 pm

13 Things People Hate about Your Open Source Docs by Andy Lester.

From the post:

1. Lacking a good README or introduction

2. Docs not available online

3. Docs only available online

4. Docs not installed with the package

5. Lack of screenshots

6. Lack of realistic examples

7. Inadequate links and references

8. Forgetting the new user

9. Not listening to the users

10. Not accepting user input

11. No way to see what the software does without installing it

12. Relying on technology to do your writing

13. Arrogance and hostility toward the user

See Andy’s post for the details and suggestions on ways to improve.

Definitely worth a close read!

January 9, 2013

NewGenLib Open Source…Update! [Library software]

Filed under: Library,Library software,OPACS,Software — Patrick Durusau @ 12:00 pm

NewGenLib Open Source releases version 3.0.4 R1 Update 1

From the blog:

The NewGenLib Open Source has announced the release of a new version 3.0.4 R1 Update 1. NewGenLib is an integrated library management system developed by Verus Solutions in conjunction with Kesaran Institute of Information and Knowledge Management in India. The software has the modules acquisitions, technical processing, serials management, circulation, administration, and MIS reports and OPAC.

What’s new in the Update?

This new update comes with a basket of additional features and enhancements, these include:

  • Full text indexing and searching of digital attachments: NewGenLib now uses Apache Tika. With this new tool not only catalogue records but their digital attachments and URLs are indexed. Now you can also search based on the content of your digital attachments
  • Web statistics: The software facilitates the generation of statistics on OPAC usage by having an allowance for Google Analytics code.
  • User ratings of Catalogue Records: An enhancement for User reviews is provided in OPAC. Users can now rate a catalogue record on a scale of 5 (Most useful to not useful). Also, one level of approval is added for User reviews and ratings. 
  • Circulation history download: Users can now download their Circulation history as a PDF file in OPAC

NewGenLib supports MARC 21 bibliographic data, MARC authority files, Z39.50 Client for federated searching. Bibliographic records can be exported in MODS 3.0 and AGRIS AP . The software is OAI-PMH compliant. NewGenLib has a user community with an online discussion forum.

If you are looking for potential topic map markets, the country population rank graphic from Wikipedia may help:
World Population Graph

Population isn’t everything but it should not be ignored either.

December 30, 2012

When is “Hello World,” Not “Hello World?”

Filed under: Graphs,MongoDB,Neo4j,Software — Patrick Durusau @ 8:43 pm

To answer that question, you need to see the post: Travel NoSQL Application – Polyglot NoSQL with SpringData on Neo4J and MongoDB.

Just a quick sample:

 In this Fuse day, Tikal Java group decided to continue its previous Fuse research for NoSQL, but this time from a different point of view – SpringData and Polyglot persistence. We had two goals in this Fuse day: try working with more than one NoSQL in the same application, and also taking advantage of SpringData data access abstractions for NoSQL databases. We decided to take MongoDB and Neo4J as document DB, and Neo4J as graph database and put them behind an existing, classic and well known application – Spring Travel Sample application.

More than the usual “Hello World” example for languages and a bit more than for most applications.

It would be a nice trend to see more robust, perhaps “Hello World+” examples.

What is your enhanced “Hello World+” going to look like in 2013?

December 21, 2012

<ANGLES>

Filed under: Editor,Software,XML — Patrick Durusau @ 10:33 am

<ANGLES>

From the homepage:

ANGLES is a research project aimed at developing a lightweight, online XML editor tuned to the needs of the scholarly text encoding community. By combining the model of intensive code development (the “code sprint”) with participatory design exercises, testing, and feedback from domain experts gathered at disciplinary conferences, ANGLES will contribute not only a working prototype of a new software tool but also another model for tool building in the digital humanities (the “community roadshow”).

Work on ANGLES began in November 2012.

We’ll have something to share very soon!

<ANGLES> is an extension of ACE:

ACE is an embeddable code editor written in JavaScript. It matches the features and performance of native editors such as Sublime, Vim and TextMate. It can be easily embedded in any web page and JavaScript application. ACE is maintained as the primary editor for Cloud9 IDE and is the successor of the Mozilla Skywriter (Bespin) project.

<ANGLES> code at Sourceforge.

I will be interested to see how ACE is extended. Just glancing at it this morning, it appears to be the traditional “display angle bang syntax” editor we all know so well.

What puzzles me is that we have been to the mountain of teaching users to be comfortable with raw XML markup and the results have not been promising.

As opposed to the experience with OpenOffice, MS Office, etc., which have proven that creating documents that are then expressed in XML, is within the range of ordinary users.

<ANGLES> looks like an interesting project but whether it brings XML editing within the reach of ordinary users is an open question.

If the XML editing puzzle is solved, perhaps it will have lessons for topic map editors.

December 6, 2012

Tails: The Amnesic Incognito Live System [Data Mining Where You Shouldn’t]

Filed under: Security,Software — Patrick Durusau @ 11:42 am

Tails: The Amnesic Incognito Live System

From the webpage:

Privacy for anyone anywhere

Tails is a live DVD+ or live USB+ that aims at preserving your privacy and anonymity.

It helps you to:

  • use the Internet anonymously almost anywhere you go and on any computer: all connections to the Internet are forced to go through the Tor network;
  • leave no trace on the computer you’re using unless you ask it explicitly;
  • use state-of-the-art cryptographic tools to encrypt your files, email and instant messaging.

If you go data mining where you are unwanted, don’t use your regular user name and real address.

In fact, something like Tails might be in order.

Being mindful that possession of a USB stick with Tails on it could be considered a breach of security, should someone choose to take it that way.

Probably best to use a DVD disgiused as a Lady Gaga disk. 😉

PS: Being mindful there is always the old fashioned hostile data mining, steal the drives: Swiss Spy Agency: Counter-Terrorism Secrets Stolen.

November 24, 2012

Consistency through semantics

Filed under: Consistency,Semantics,Software — Patrick Durusau @ 2:13 pm

Consistency through semantics by Oliver Kennedy.

From the post:

When designing a distributed systems, one of the first questions anyone asks is what kind of consistency model to use. This is a fairly nuanced question, as there isn’t really one right answer. Do you enforce strong consistency and accept the resulting latency and communication overhead? Do you use locking, and accept the resulting throughput limitations? Or do you just give up and use eventual consistency and accept that sometimes you’ll end up with results that are just a little bit out of sync.

It’s this last bit that I’d like to chat about today, because it’s actually quite common in a large number of applications. This model is present in everything from user-facing applications like Dropbox to SVN/GIT, to back-end infrastructure systems like Amazon’s Dynamo and Yahoo’s PNUTs. Often, especially in non-critical applications latency and throughput are more important than dealing with the possibility that two simultaneous updates will conflict.

So what happens when this dreadful possibility does come to pass? Clearly the system can’t grind to a halt, and often just randomly discarding one of these updates is the wrong thing to do. So what happens? The answer is common across most of these systems: They punt to the user.

Intuitively, this is the right thing to do. The user sees the big picture. The user knows best how to combine these operations. The user knows what to do, so on those rare occurrences where the system can’t handle it, the user can.

But why is this the right thing to do? What does the user have that the infrastructure doesn’t?

Take the time to read the rest of Oliver’s post.

He distinguishes rather nicely between applications and users.

October 23, 2012

Up to Date on Open Source Analytics

Filed under: Analytics,MySQL,PostgreSQL,R,Software — Patrick Durusau @ 8:17 am

Up to Date on Open Source Analytics by Steve Miller.

Steve updates his Wintel laptop with the latest releases of open source analytics tools.

Steve’s list:

What’s on your list?

I first saw this mentioned at KDNuggets.

October 15, 2012

People and Process > Prescription and Technology

Filed under: Project Management,Semantic Diversity,Semantics,Software — Patrick Durusau @ 3:55 pm

Factors that affect software systems development project outcomes: A survey of research by Laurie McLeod and Stephen G. MacDonell. ACM Computing Surveys (CSUR) Surveys Volume 43 Issue 4, October 2011 Article No. 24, DOI: 10.1145/1978802.1978803.

Abstract:

Determining the factors that have an influence on software systems development and deployment project outcomes has been the focus of extensive and ongoing research for more than 30 years. We provide here a survey of the research literature that has addressed this topic in the period 1996–2006, with a particular focus on empirical analyses. On the basis of this survey we present a new classification framework that represents an abstracted and synthesized view of the types of factors that have been asserted as influencing project outcomes.

As with most survey work, particularly ones that summarize 177 papers, this is a long article, some fifty-six pages.

Let me try to tempt you into reading it by quoting from Angelica de Antonio’s review of it (in Computing Reviews, Oct. 2012):

An interesting discussion about the very concept of project outcome precedes the survey of factors, and an even more interesting discussion follows it. The authors stress the importance of institutional context in which the development project takes place (an aspect almost neglected in early research) and the increasing evidence that people and process have a greater effect on project outcomes than technology. A final reflection on what projects still continue to fail—even if we seem to know the factors that lead to success—raises a question on the utility of prescriptive factor-based research and leads to considerations that could inspire future research. (emphasis added)

Before you run off to the library or download a copy of the survey, two thoughts to keep in mind:

First, if “people and process” are more important than technology, where should we place the emphasis in projects involving semantics?

Second, if “prescription” can’t cure project failure, what are its chances with semantic diversity?

Thoughts?

September 9, 2012

Best Open Source[?]

Filed under: Open Source,Software — Patrick Durusau @ 3:16 pm

Best Open Source

Are you familiar with this open source project listing site?

I ask because I encountered it today and while it looks interesting, I have the following concerns:

  • Entries are not dated (at least that I can find). Undated entries are not quite useless but nearly so.
  • Entries are not credited (no authors cited). Another strike against the entries.
  • Rating (basis for) isn’t clear.

It looks suspicious but it could be poor design.

Comments/suggestions?

September 8, 2012

Software fences

Filed under: Knowledge,Software — Patrick Durusau @ 10:07 am

Software fences by John D. Cook.

A great quote from G. K. Chesterton.

Do reformers of every generation think their forefathers were fools or do reformers have a mistaken belief in “progress?”

Rather than saying “progress,” what if we say we know things “differently” than our forefathers?

Not better or worse, just differently.

July 30, 2012

Chaos Monkey released into the wild

Filed under: Software,Systems Administration — Patrick Durusau @ 6:08 pm

Chaos Monkey released into the wild by Cory Bennett and Ariel Tseitlin

From the post:

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach.

We have written about our Simian Army in the past and we are now proud to announce that the source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.

Do you think your applications can handle a troop of mischievous monkeys loose in your infrastructure? Now you can find out.

What is Chaos Monkey?

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support. The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don’t, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.

At first I was unsure if NetFlix is hopeful its competitors will run Chaos Monkey or if they really run it internally. 😉

It certainly is a way to test your infrastructure. And quite possibly a selling point to clients who want more than projected or historical robustness.

Makes me curious, allowing for different infrastructures, how would you stress test a topic map installation?

And do so on a regular basis?

I first saw this at Alex Popescu’s myNoSQL.

July 29, 2012

OSCON 2012

OSCON 2012

Over 4,000 photographs were taken at the MS booth.

I wonder how many of them include Doug?

Drop by the OSCON website after you count photos of Doug.

Your efforts at topic mapping will improve from the experience.

From the OSCON site visit.

What you get from counting photos of Doug is unknown. 😉

July 18, 2012

2013 FOSE Call for Presentations

Filed under: Conferences,Government,Software — Patrick Durusau @ 3:55 pm

2013 FOSE Call for Presentations

From the webpage:

The FOSE Team welcomes presentation proposals that provide meaningful, actionable insights about technology development for government IT decision makers. We are looking for presentations that detail use-case studies, lessons learned, or emerging trends that improve operational efficiency and ignite innovation within and across government agencies. We are also specifically seeking Local, Federal and State Government Employees with stories to tell about their IT experiences and successes.

It’s a vendor show so prepare accordingly.

Lots of swag, hire booth help at the local modeling agency, etc.

You can’t make a sale if you don’t get their attention.

Deadline for submissions: September 14, 2012.

Topic map based solutions should make a good showing against traditional ETL (Extra Tax and Labor) solutions.

No charge for use the expansion of ETL (it probably isn’t even original but if not, I don’t remember the source).

June 23, 2012

Elements of Software Construction [MIT 6.005]

Filed under: Software,Subject Identity,Topic Maps — Patrick Durusau @ 6:59 pm

Elements of Software Construction

Description:

This course introduces fundamental principles and techniques of software development. Students learn how to write software that is safe from bugs, easy to understand, and ready for change.

Topics include specifications and invariants; testing, test-case generation, and coverage; state machines; abstract data types and representation independence; design patterns for object-oriented programming; concurrent programming, including message passing and shared concurrency, and defending against races and deadlock; and functional programming with immutable data and higher-order functions.

From the MIT OpenCourseware site.

Of interest to anyone writing topic map software.

It should also be of interest to anyone evaluating how software shapes what subjects we can talk about and how we can talk about them. Data structures have the same implications.

Not necessary to undertake such investigations in all cases. There are many routine uses for common topic map software.

Being able to see when the edges of a domain don’t quite fit or there may be gaps in coverage for an information system, are necessary skills for non-routine cases.

January 22, 2012

NGINX: The Faster Web Server Alternative

Filed under: Software,Web Server — Patrick Durusau @ 7:31 pm

NGINX: The Faster Web Server Alternative by Steven J. Vaughan-Nichols.

From the post:

Picking a Web server used to be easy. If you ran a Windows shop, you used Internet Information Server (IIS); if you didn’t, you used Apache. No fuss. No muss. Now, though, you have more Web server choices, and far more decisions to make. One of the leading alternatives, the open-source NGINX, is now the number two Web server in the world, according to Netcraft, the Web server analytics company.

NGINX (pronounced “engine X”) is an open-source HTTP Web server that also includes mail services with an Internet Message Access Protocol (IMAP) and Post Office Protocol (POP) server. NGINX is ready to be used as a reverse proxy, too. In this mode NGINX is used to load balance among back-end servers, or to provide caching for a slower back-end server.

Companies like the online TV video on demand company Hulu use NGINX for its stability and simple configuration. Other users, such as Facebook and WordPress.com, use it because the web server’s asynchronous architecture gives it a small memory footprint and low resource consumption, making it ideal for handling multiple, actively changing Web pages.

That’s a tall order. According to NGINX’s principal architect Igor Sysoev, here’s how NGINX can support hundreds of millions of Facebook users.

I have to admit, NGINX being web server #2 caught my attention. Not to mention that it powers Hulu, Facebook and WordPress.com.

It has been years since I have even looked at an Apache web server (use to run them) but I do remember their stability and performance. And Apache would be my reflex recommendation for delivering web pages from a topic map application. Why re-write what already works?

Now NGINX comes along with impressive performance numbers and potentially new ways to organize on the server side.

Read the article, grab a copy of NGINX and let me know what you think.

December 17, 2011

Semantic Prediction?

Filed under: Bug Prediction,Search Behavior,Semantics,Software — Patrick Durusau @ 6:34 am

Bug Prediction at Google

From the post:

I first read this post because of the claim that 50% of the code base at Google changes each month. So it says but perhaps more on that another day.

While reading the post I ran across the following:

In order to help identify these hot spots and warn developers, we looked at bug prediction. Bug prediction uses machine-learning and statistical analysis to try to guess whether a piece of code is potentially buggy or not, usually within some confidence range. Source-based metrics that could be used for prediction are how many lines of code, how many dependencies are required and whether those dependencies are cyclic. These can work well, but these metrics are going to flag our necessarily difficult, but otherwise innocuous code, as well as our hot spots. We’re only worried about our hot spots, so how do we only find them? Well, we actually have a great, authoritative record of where code has been requiring fixes: our bug tracker and our source control commit log! The research (for example, FixCache) indicates that predicting bugs from the source history works very well, so we decided to deploy it at Google.

How it works

In the literature, Rahman et al. found that a very cheap algorithm actually performs almost as well as some very expensive bug-prediction algorithms. They found that simply ranking files by the number of times they’ve been changed with a bug-fixing commit (i.e. a commit which fixes a bug) will find the hot spots in a code base. Simple! This matches our intuition: if a file keeps requiring bug-fixes, it must be a hot spot because developers are clearly struggling with it.

So, if that is true for software bugs, doesn’t it stand to reason the same is true for semantic impedance? That is when a user selects one result and then within some time window selects one different from the first, the reason is the first failed to meet their criteria for a match? Same intuition. Users change because the match, in their view, failed.

Rather than trying to “reason” about the semantics of terms, we can simply observe user behavior with regard to those terms in the aggregate. And perhaps even salt the mine as it were with deliberate cases to test theories about the semantics of terms.

I haven’t done the experiment, yet, but it is certainly something that I will be looking into this next year. I think it has definite potential and would scale.

December 14, 2011

Network Graph Visualizer

Filed under: Collaboration,Networks,Software,Visualization — Patrick Durusau @ 7:44 pm

Network Graph Visualizer

I ran across this at Github while tracking the progress of a project.

Although old hat (2008), I thought it worth pointing out as a graph that has one purpose, to keep developers informed of each others’ activities in a collaborative environment, and it does that very well.

I suspect there is a lesson there for topic map software (or even software in general).

December 11, 2011

The Coron System

Filed under: Associations,Data Mining,Software — Patrick Durusau @ 9:23 pm

The Coron System

From the overview:

Coron is a domain and platform independent, multi-purposed data mining toolkit, which incorporates not only a rich collection of data mining algorithms, but also allows a number of auxiliary operations. To the best of our knowledge, a data mining toolkit designed specifically for itemset extraction and association rule generation like Coron does not exist elsewhere. Coron also provides support for preparing and filtering data, and for interpreting the extracted units of knowledge.

In our case, the extracted knowledge units are mainly association rules. At the present time, finding association rules is one of the most important tasks in data mining. Association rules allow one to reveal “hidden” relationships in a dataset. Finding association rules requires first the extraction of frequent itemsets.

Currently, there exist several freely available data mining algorithms and tools. For instance, the goal of the FIMI workshops is to develop more and more efficient algorithms in three categories: (1) frequent itemsets (FI) extraction, (2) frequent closed itemsets (FCI) extraction, and (3) maximal frequent itemsets (MFI) extraction. However, they tend to overlook one thing: the motivation to look for these itemsets. After having found them, what can be done with them? Extracting FIs, FCIs, or MFIs only is not enough to generate really useful association rules. The FIMI algorithms may be very efficient, but they are not always suitable for our needs. Furthermore, these algorithms are independent, i.e. they are not grouped together in a unified software platform. We also did experiments with other toolkits, like Weka. Weka covers a wide range of machine learning tasks, but it is not really suitable for finding association rules. The reason is that it provides only one algorithm for this task, the Apriori algorithm. Apriori finds FIs only, and is not efficient for large, dense datasets.

Because of all these reasons, we decided to group the most important algorithms into a software toolkit that is aimed at data mining. We also decided to build a methodology and a platform that implements this methodology in its entirety. Another advantage of the platform is that it includes the auxiliary operations that are often missing in the implementations of single algorithms, like filtering and pre-processing the dataset, or post-processing the found association rules. Of course, the usage of the methodology and the platform is not narrowed to one kind of dataset only, i.e. they can be generalized to arbitrary datasets.

I found this too late in the weekend to do more than report it.

I have spent most of the weekend trying to avoid expanding a file to approximately 2 TB before parsing it. More on that saga later this week.

Anyway, Coron looks/sounds quite interesting.

Anyone using it that cares to comment on it?

November 25, 2011

SpiderDuck: Twitter’s Real-time URL Fetcher

Filed under: Software,Topic Map Software,Topic Map Systems,Tweets — Patrick Durusau @ 4:26 pm

SpiderDuck: Twitter’s Real-time URL Fetcher

A bit of a walk on the engineering side but in order to be relevant, topic maps do have to be written and topic map software implemented.

This a very interesting write-up of how Twitter relied mostly on open source tools to create a system that could be very relevant to topic map implementations.

For example, the fetch/no-fetch decision for URLs is based on a comparison to URLs fetched within X days. Hmmm, comparison of URLs, oh, those things that occur in subjectIdentifier and subjectLocator properties of topics. Do you smell relevance?

And there is harvesting of information from web pages, one assumes that could be done on “information items” from a topic map as well, except there it would be properties, etc. Even more relevance.

What parts of SpiderDuck do you find most relevant to a topic map implementation?

November 16, 2011

expressor – Data Integration Platform

Filed under: Data Integration,Software — Patrick Durusau @ 8:18 pm

expressor – Data Integration Platform

I ran across expressor while reading a blog entry that will be going through Facebook and Twitter data with it as integration software.

It has a community edition but apparently only runs on Windows (XP and Windows 7, there’s a smart move).

Before I download/install, any comments? Suggestions for other integration tasks?

Thanks!

Oh, the post that got me started on this: expressor: Enterprise Application Integration with Social Networking Applications. Realize that expressor is an ETL tool but sometimes that is what a job requires.

November 5, 2011

Go away kid, you bother me.

Filed under: Software,Solr — Patrick Durusau @ 6:42 pm

I was reminded of this W.C. Fields quote when I read the following from Fromtek:

Formtek releases version 5.4.2 of the Formtek | Orion 5 SDK Pure Java API product for Linux, providing Full-Text-Search capability.

I went to the announcement (a pdf file) only to read:

Formtek releases version 5.4.2 of the Formtek | Orion 5 SDK Pure Java API™ product for Linux®, which provides support for:

  • Full-Text Indexing and Search

If you are a current customer, you can find out more by logging on at:

http://support.formtek.com/Login.asp

After logging on, click the link for Formtek Product Documentation to view the Product Release Notes for this release.

If you are not a current customer and would like more information, please contact us at sales@formtek.com.

The Formtek blog said: ECM: Formtek Announces SOLR Integration for Orion ECM, hence my interest, the integration of SOLR.

But that was all. To be contrasted with announcements from other vendors that often give specifics for everyone to read about integration of open source projects into their software offerings, even proprietary ones.

I’m not interested enough to ask for more information from Formtek. Are you?

November 4, 2011

Near-real-time readers with Lucene’s SearcherManager and NRTManager

Filed under: Indexing,Lucene,Software — Patrick Durusau @ 6:11 pm

Near-real-time readers with Lucene’s SearcherManager and NRTManager

From the post:

Last time, I described the useful SearcherManager class, coming in the next (3.5.0) Lucene release, to periodically reopen your IndexSearcher when multiple threads need to share it. This class presents a very simple acquire/release API, hiding the thread-safe complexities of opening and closing the underlying IndexReaders.

But that example used a non near-real-time (NRT) IndexReader, which has relatively high turnaround time for index changes to become visible, since you must call IndexWriter.commit first.

If you have access to the IndexWriter that’s actively changing the index (i.e., it’s in the same JVM as your searchers), use an NRT reader instead! NRT readers let you decouple durability to hardware/OS crashes from visibility of changes to a new IndexReader. How frequently you commit (for durability) and how frequently you reopen (to see new changes) become fully separate decisions. This controlled consistency model that Lucene exposes is a nice “best of both worlds” blend between the traditional immediate and eventual consistency models.

Getting into the hardcore parts of Lucene!

Understanding Lucene (or a similar indexing engine) is critical to both mining data as well as delivery of topic map based information to users.

October 22, 2011

Java Wikipedia Library (JWPL)

Filed under: Data Mining,Java,Software — Patrick Durusau @ 3:16 pm

Java Wikipedia Library (JWPL)

From the post:

Lately, Wikipedia has been recognized as a promising lexical semantic resource. If Wikipedia is to be used for large-scale NLP tasks, efficient programmatic access to the knowledge therein is required.

JWPL (Java Wikipedia Library) is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia. The high-performance Wikipedia API provides structured access to information nuggets like redirects, categories, articles and link structure. It is described in our LREC 2008 paper.

JWPL contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page. The parser can also be used stand-alone with other texts using MediaWiki markup.

Further, JWPL contains the tool JWPLDataMachine that can be used to create JWPL dumps from the publicly available dumps at download.wikimedia.org.

Wikipedia is a resource of growing interest. This toolkit may prove useful in mining it for topic map purposes.

October 20, 2011

Tech Survivors: Geek Technologies…

Filed under: Language,Pattern Matching,Software — Patrick Durusau @ 6:33 pm

Tech Survivors: Geek Technologies That Still Survive 25 to 50 Years Later

Simply awesome!

Useful review for a couple of reasons:

First: New languages, formats, etc., will emerge but legacy systems “….will be with you always.” (Or at least it will feel that way so being able to interface with legacy systems (understand their semantics) is going to be important for very long time.)

Second: What was it about these technologies that made them succeed? (I don’t have the answer or I would be at the USPTO filing every patent and variant of patent that I could think of. 😉 It is clearly non-obvious because no one else is there either.) Equally long-lived technologies are with us today, we just don’t know which ones.

Would not hurt to put this on your calendar to review every year or so. The more you know about new technologies, the more likely you are to spot a resemblance or pattern matching one of these technologies. Maybe.

October 19, 2011

MyBioSoftware

Filed under: Bioinformatics,Biomedical,Software — Patrick Durusau @ 3:16 pm

MyBioSoftware: Bioinformatics Software Blog

From the blog:

My Biosoftware Blog supplies free bioinformatics software for biology scientist, every day.

Impressive listing of bioinformatics software. Not my area (by training). It is one in which I am interested because of the rapid development of data analysis techniques, which may be applicable more broadly.

Question/Task: Select any two software packages in a category and document the output formats that they support. Thinking it would be useful to have a chart of formats supported for each category. May uncover places where interchange isn’t easy or perhaps even possible.

October 8, 2011

Tree Traversal in O(1) Space

Filed under: Algorithms,Graphs,Software,Trees — Patrick Durusau @ 8:14 pm

Tree Traversal in O(1) Space by Sanjoy.

From the post:

I’ve been reading some texts recently, and came across a very interesting way to traverse a graph, called pointer reversal. The idea is this — instead of maintaining an explicit stack (of the places you’ve visited), you try to store the relevant information in the nodes themselves. One approach that works on directed graphs with two (outgoing) arcs per node is called the Deutsch-Schorr-Waite algorithm. This was later extended by Thorelli to work for directed graphs with an unknown number of (outgoing) arcs per node.

Implemented here for a tree, care to go for a more general graph?

« Newer PostsOlder Posts »

Powered by WordPress