Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 25, 2011

a speed gun for spam

Filed under: Subject Identifiers,Subject Identity — Patrick Durusau @ 7:35 pm

a speed gun for spam

From the post:

Apart from the content there are various features from metadata (like IP etc) which can help tell a spammer and regular user apart. Following are results of some data analysis (done on roughly 8000+ comments) which speak of another feature which proves to be a good discriminator. Hopefully this will aid others fighting spam/abuse (if not already using a similar feature).

(graph omitted)

The discriminator referred above is typing speed. The graph above plots the content length of a comment posted by a user against the (approximate) time he took to write it. If a user posts more than one comment in window of 5-10 minutes, we can consider those comments as consecutive posts. …

An illustration that subject identity tests are limited only by your imagination. From what I understand, very few spammers self-identify themselves using OWL and URLs. So as in this case, you need other tests to separate them.

A follow-up on this would be to see if particular spammers have speed patterns in their posts or searching more broadly, say across a set of blogs, a particular pattern. That is they start with blog X and then move down the line. Could be useful for dynamically configuring firewalls to block further content after they hit the first blog.

You have heard that passwords + keying patterns are used for personal identity?

Topincs 5.5.1

Filed under: Topic Map Software,Topincs — Patrick Durusau @ 7:35 pm

Topincs 5.5.1

From the website:

This version adds the store command ‘import-system-map’ and the general command ‘import-all-system-map’.

SquareCog’s SquareBlog

Filed under: Pig — Patrick Durusau @ 7:35 pm

SquareCog’s SquareBlog by Dmitriy Ryaboy.

Blog devoted mostly to Pig and related technologies.

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)

Filed under: Algorithms,Hadoop,MapReduce — Patrick Durusau @ 7:34 pm

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011) by Amund Tveit.

From the post:

It’s been a year since I updated the mapreduce algorithms posting last time, and it has been truly an excellent year for mapreduce and hadoop – the number of commercial vendors supporting it has multiplied, e.g. with 5 announcements at EMC World only last week (Greenplum, Mellanox, Datastax, NetApp, and Snaplogic) and today’s Datameer funding announcement , which benefits the mapreduce and hadoop ecosystem as a whole (even for small fish like us here in Atbrox). The work-horse in mapreduce is the algorithm, this update has added 35 new papers compared to the prior posting, new ones are marked with *. I’ve also added 2 new categories since the last update – astronomy and social networking.

A truly awesome resource!

This promises to be hours of entertainment!

Adding bed/wig data to dalliance genome browser

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:34 pm

Adding bed/wig data to dalliance genome browser

From the post:

I have been playing a bit with the dalliance genome browser. It is quite useful and I have started using it to generate links to send to researchers to show regions of interest we find from bioinformatics analyses.

I added a document to my github repo describing how to display a bed file in the browser. That rst is here and displayed in inline below.

It uses the UCSC binaries for creating BigWig/BigBed files because dalliance can request a subset of the data without downloading the entire file given the correct apache configuration (also described below).

This will require a recent version of dalliance because there was a bug in the BigBed parsing until recently.

Dalliance Data Tutorial

dalliance is a web-based scrolling genome-browser. It can display data from remote DAS servers or local or remote BigWig or BigBed files.

This will cover how to set up an html page that links to remote DAS services. It will also show how to create and serve BigWig and BigBed files.

Obviously of interest to the bioinformatics community (who are no doubt already aware of it) but I wanted to point out the ability to display data from remote servers/data sets.

Dumbo

Filed under: Hadoop,MapReduce,Python — Patrick Durusau @ 7:34 pm

Dumbo

Have you seen Dumbo?

Described as:

Dumbo is a project that allows you to easily write and run Hadoop programs in Python (it’s named after Disney’s flying circus elephant, since the logo of Hadoop is an elephant and Python was named after the BBC series “Monty Python’s Flying Circus”). More generally, Dumbo can be considered to be a convenient Python API for writing MapReduce programs.

I ran across DAG jobs and mapredtest on the Dumbo blog. Seeing DAG meant I had to run the reference down so here we are. 😉


The use of DAGs (directed acyclic graphs) with text representation systems have been studied by Michael Sperberg-McQueen and Claus Huitfeld for many years. DAGs are thought to be useful for some cases of overlapping markup.

I remain unconvinced by the DAG approach.

Research on Visualization Method and System for Patent Intelligence Knowledge (humor)

Filed under: Humor — Patrick Durusau @ 7:34 pm

Research on Visualization Method and System for Patent Intelligence Knowledge

Err, 114 pages, author identified as “Andy,” for $40.00?

The page also says: “Economics Paper, Economics Term Paper, Economics Research Paper”

Is this one of those paper mills I keep hearing about?

I included the abstract so you could be amused by the writing style.

Abstract:

Patent intelligence knowledge includes the most important information resources, which are required by economic development, technological innovation and strategic decision, is the concentrated expression of technical innovation and innovative products, and provides important decision support to compete. Analyzing patent intelligence and mining patent intelligence knowledge could be helpful for shortening research cycles, saving scientific money, and making the product strategy keep pace with the markets.At present, the domestic patent intelligence analysis mainly focuses on statistical features, and could not discovery the technology and knowledge rules in patent intelligence automatically. While information visualization, as one of the most conventional technique in data mining and knowledge discovery, provides an effective method for patent intelligence analysis and acquisition and expression of patent intelligence knowledge. And information visualization is the trend of patent intelligence analysis.In the research, we apply information visualization in patent intelligence analysis, research the method of visualization for Chinese patent intelligence analysis, and develop visualization system for patent intelligence knowledge. Practical work is as follows:Firstly, the key technology in topic maps visualization of patent intelligence knowledge is studied. In the research, hierarchical thematic map is generated with the help of text clustering results and the similarity of text. And then patent intelligence and topics are distributed by improved layout algorithm. A visual patent intelligence topic map is generated by contour construction algorithm faunally.Secondly, the research also focuses on multi-view visualization of patent intelligence knowledge. Stick to the problem that single view could not reveal the full range of patent intelligence, the approach that patent intelligence could be analyzed in multi-dimension and visualized with multi-view is proposed, and the framework of multi-view mapping is also studied in order to ensure the integrity of multi-view. In the approach, the distribution of topics is presented as topic maps; knowledge source of patent intelligence is showed as visual geographical map, and hierarchical abstract concept is expressed as IPC ontology tree.Finally, applying information visualization in patent intelligence analysis, a visualization system for Chinese patent intelligence is designed and implemented.

collocations in wikipedia, part 1

Filed under: Collocation,Natural Language Processing — Patrick Durusau @ 7:34 pm

collocations in wikipedia, part 1

From the post:

collocations are combinations of terms that occur together more frequently than you’d expect by chance.

they can include

  • proper noun phrases like ‘Darth Vader’
  • stock/colloquial phrases like ‘flora and fauna’ or ‘old as the hills’
  • common adjectives/noun pairs (notice how ‘strong coffee’ sounds ok but ‘powerful coffee’ doesn’t?)

let’s go through a couple of techniques for finding collocations taken from the exceptional nlp text “foundations of statistical natural language processing” by manning and schutze.

Looks like the start of a very interesting series on collocation (statistical) in Wikipedia. Which is a serious data set for training purposes.

BTW, don’t miss the homepage. Lots of interesting resources.


Update: 18 November 2011

See also:

collocations in wikipedia, part 2

finding phrases with mutual information [collocations, part 3]

I am making a separate blog post on parts 2 and 3 but just in case you come here first…. Enjoy!

Datablog

Filed under: Data Source,Visualization — Patrick Durusau @ 7:33 pm

Datablog

From the Guardian in the UK. If you don’t know the Guardian, you are missing a real treat.

The Datablog offers visualizations of facts that otherwise may be difficult to grasp or that become more compelling in graphic form.

Browse around and you will find a number of interesting resources, such as a listing of all the visualizations for the last 2 years and information on how they make data available.

Some news outlets in the U.S., such as the New York Times, have similar efforts but I don’t know of any that are quite this good. Suggestions anyone?

This and similar resources should give you ideas on how to visualize information to discover information and subjects for your topic maps as well as ways that you can present topic map data more effectively to your users.

  1. Choose one visualization from the Guardian and explain what advantages it offers over a simple table layout of the same information. 2-3 pages (no citations)
  2. What other information sets could be effectively displayed using a technique similar to #1? What would be different about it over table display? 2-3 pages (no citations)
  3. What are the limitations of the visualization you have chosen for #2? 2-3 pages (no citations)

Humanizing Bioinformatics

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:33 pm

Humanizing Bioinformatics by Saaien Tist.

From the post:

I was invited last week to give a talk at this year’s meeting of the Graduate School Structure and Function of Biological Macromolecules, Bioinformatics and Modeling (SFMBBM). It ended up being a day with great talks, by some bright PhD students and postdocs. There were 2 keynotes (one by Prof Bert Poolman from Groningen (NL) and one by myself), and a panel discussion on what the future holds for people nearing the end of their PhDs.

My talk was titled “Humanizing Bioinformatics” and received quite well (at least some people still laughed at my jokes (if you can call them that); even at the end). I put the slides up on slideshare, but I thought I’d explain things here as well, because those slides will probably not convey the complete story.

Let’s ruin the plot by mentioning it here: we need data visualization to counteract the alienation that’s happening between bioinformaticians and bright data miners on the one hand, and the user/clinician/biologist on the other. We need to make bioinformatics human again. (emphasis in original)

I just wish there had been a video recording of this presentation!

Questions:

  1. Do you agree with the issues that Saalen raises? Are there more that you would raise? 2-3 pages (no citations)
  2. Have “semantics” become what can be evaluated by a computer? Pick yes, no, undecided and cite web examples for your position. 2-3 pages
  3. How much do you trust the answers to your searches? (Classroom discussion question.)

October 24, 2011

Introduction to Spring Data Neo4j – Webinar

Filed under: Graphs,Neo4j,Spring Data — Patrick Durusau @ 6:46 pm

Introduction to Spring Data Neo4j – Webinar – 2011-11-10 10:00 PT

From the website:

The Spring Data project makes it easier to build Spring-powered applications that use new data access technologies such as non-relational NOSQL databases, cloud based data services, for instance graph databases. This webinar is designed for enterprise developers that are working with Spring and need to understand how they would integrate a NOSQL graph database.

Spring Data Neo4j is an integration library for the open source NOSQL graph database Neo4j and has been around for over a year, evolving from its infancy as brainchild of Rod Johnson and Emil Eifrem. It supports multiple, annotation based POJO to Graph Mapping strategies, a Neo4j Template API and extensive support for Spring Data Repositories. It can work with an embedded graph database or with the standalone Neo4j Server.

ONKI

Filed under: Ontology — Patrick Durusau @ 6:45 pm

ONKI (Finnish Ontology Library Service)

From the website:

The ONKI service contains Finnish and international ontologies, vocabularies and thesauri needed for publishing your content cost-efficiently on the Semantic Web. Ontologies are conceptual models identifying the concepts of a domain. They contain machine “understandable” descriptions of the relations between the concepts.

ONKI is published and maintained by Semantic Computing Research Group SeCo. It is part of the on-going project to build a national semantic web infrastructure to Finland (FinnONTO).

Collection of ontologies/vocabularies, some of which will be familiar, others, perhaps less so.

Searchable ontologies/vocabularies and in many cases, downloadable.

The Finnish Collaborative Holistic Ontology (KOKO)

Filed under: Ontology — Patrick Durusau @ 6:44 pm

The Finnish Collaborative Holistic Ontology (KOKO)

From the website:

The Finnish Collaborative Holistic Ontology is the general, aggregated ontology of the National ontology service ONKI. KOKO ontology has the General Finnish Ontology YSO as its top ontology and a variety of other domain specific ontologies extending its concepts into more detailed subconcept hierarchies. KOKO’s domain specific ontologies include initially MAO (cultural heritage), AFO (agriforestry), TAO (applied arts), VALO (photography), and other ontologies are being added to KOKO by ontology matching.

The idea of KOKO and the National Finnish ontology infrastructure is described in English and in Finnish in the articles and reports below.

The KOKO ontology is created as a part of the FinnONTO project.

If you are looking for upper ontologies, this is one.

I ran across this while looking up references in: MUTU: An Analysis Tool….

MUTU: An Analysis Tool…

Filed under: Mapping,Ontology — Patrick Durusau @ 6:44 pm

MUTU: An Analysis Tool for Maintaining a System of Hierarchically Linked Ontologies (pdf)

Abstract

We consider ontology evolution in a system of light-weight Linked Data ontologies, aligned with each other to form a larger ontology system. When one ontology changes, the human editor must keep track of the actual changes and of the modifications needed in the related ontologies in order to keep the system consistent. This paper presents an analysis tool MUTU, by which such changes and their potential effects on other ontologies can be found. Such an analysis is useful for the ontology editors for understanding the differences between ontology versions, and for updating linked ontologies when changes occurred in other components of an ontology system.

Not available on the web, yet, but sounds interesting.

Subject Recognition: Discrete or Continuous

Filed under: Artificial Intelligence,Subject Recognition — Patrick Durusau @ 6:43 pm

While creating the entry for Fast Deep/Recurrent Nets for AGI Vision, I took particular note of the unbroken hand writing competitions. That task, for computer vision, is more difficult than “segmented” hand writing with breaks between the letters.

Are there parallels to subject recognition as performed by our computers versus ourselves?

That is we record and use “discrete” values in computers that are used for subject recognition.

We as human observers report “discrete” values when asked about subject recognition but in fact recognize subjects along non-discrete continuum of values.

I am interested in the application of techniques similar to continuous handwriting recognition applied to subject recognition.

Comments?

Fast Deep/Recurrent Nets for AGI Vision

Filed under: Artificial Intelligence,Neural Networks,Pattern Recognition — Patrick Durusau @ 6:43 pm

Fast Deep/Recurrent Nets for AGI Vision

Jürgen Schmidhuber at AGI-2011 delivers a deeply amusing presentation promoting neural networks, particularly deep/recurrent networks pioneered by his lab.

The jargon falls fast and furious so you probably want to visit his homepage for pointers to more information.

A wealth of information awaits! Suggestions on what looks the most promising for assisted topic map authoring welcome!

Topincs 5.5.0 Released!

Filed under: Topic Map Software,Topincs — Patrick Durusau @ 6:43 pm

Topincs 5.5.0 Released!

Robert Cerny announced today that Topincs 5.5.0 is available for downloading.

Release notes:
http://www.topincs.com/issues/Topincs_5.5.0

Use this procedure to update:
http://www.topincs.com/adminreference/update-secure

From the release notes:

Description

This release allows the creation of better understandable and easier editable web databases. By introducing perspective and language in label rules it is possible to create contextual labels. The complexity of data entry forms can be reduced by initially hiding less important form fields.

Furthermore the abilities to tailor content and restrict access were improved. The default behavior of displaying all associations between topics as links between the topic pages can now be deactivated by editing the topic type or the respective topic role constraints. The Topincs cache varies the representation of pages depending on user group.

I like the idea of “contextual labels.” Will have to give this release a spin to see how that works! More later.

Thanks Robert!

OCLC Developer Network

Filed under: Identification,Identifiers,Library Associations,OCLC Number — Patrick Durusau @ 6:42 pm

OCLC Developer Network

From the webpage:

The OCLC Developer Network is a community of developers collaborating to propose, discuss and test OCLC Web Services. This open source, code-sharing infrastructure improves the value of OCLC data for all users by encouraging new OCLC Web Service uses.

Thought while I was looking at OCLC resources I might as well give a shout out to the OCLC Developer Network. A community that has an interest in identifiers and identification for the purpose of furthering access to information. Who could be more sympathetic to topic maps?

WorldCat Identities (Web Service)

Filed under: LCCN,OCLC Number — Patrick Durusau @ 6:42 pm

WorldCat Identities (Web Service)

From the webpage:

A service that provides personal, corporate and subject-based identities (writers, authors, characters, corporations, horses, ships, etc.) based on information in WorldCat.

  • Provides direct links to identity information based on LCCN or a personal name
  • Provides access to identity information using OpenURL based on lastname and OCLC Number
  • Provides search access to identity information

What you get

  • Browsable and searchable access to names in WorldCat and associated information such as
    • Works by
    • Works About

For non-librarians:

LCCN = Library of Congress Control Number.

OCLC number = OCLC Control Number.

A widget to search for LCCN or OCLC numbers would be quite handy.

For library purposes, I think merging on either one would be adequate. Would have to work out what do to with MARC fields that had varying data. Do we capture it with provenance? Discard, favoring one record over the other, etc.

For class:

  1. What would you suggest as an interface to this service? 2 pages (no citations)
  2. How would you use this interface at your library (or your local library)? 3 pages (no citations)

WorldCat Identities Network

Filed under: Associations,Identification,Identifiers — Patrick Durusau @ 6:41 pm

WorldCat Identities Network

A project of OCLC Research, the WorldCat Identities Network is described as:

The WorldCat Identity Network uses the WorldCat Identities Web Service and the WorldCat Search API to create an interactive Related Identity Network Map for each Identity in the WorldCat Identities database. The Identity Maps can be used to explore the interconnectivity between WorldCat Identities.

A WorldCat Identity can be a person, a thing (e.g., the Titanic), a fictitious character (e.g., Harry Potter), or a corporation (e.g., IBM).

I can’t claim to be a fan of jumpy network node displays but that isn’t a criticism, more a matter of personal taste. Some people find that sort of display quite useful.

The information conveyed, leaving display to one side, is quite interesting. It has just enough fuzziness (to me at any rate) to approach the experience of serendipitous discovery using more traditional library tools. I suspect that will vary from topic to topic but that was my experience with briefly using the interface.

Despite my misgivings about the interface, I will be returning to explore this service fairly often.

BTW, the service is obviously mis-named. What is being delivered is what we used to call “see also” or related references, thus: WorldCat “See Also” Network would be a more accurate title.

For class:

  1. Spend at least an hour or more with the service and write a 2 page summary of what you liked/disliked about it. (no citations)
  2. What subject/relationship did you choose to follow? Discover anything you did not expect? 1 page (no citations)

October 23, 2011

Tweet Topic Explorer

Filed under: Mapping,Visualization — Patrick Durusau @ 7:22 pm

Tweet Topic Explorer by Jeff Clark.

From the post:

One problem I face on a daily basis is to decide for a given Twitter account whether I want to follow it or not. I consider many factors when making the decision such as language of their tweets, frequency, whether they interact on twitter with other people I admire, or if I have some personal or geographic connection with them. But the most critical factor for me is whether they tweet about things that match my interests. Sometimes you can get a hint about this by looking at their short one line twitter bio but the best way is usually to scan their latest tweets.

I have created a new tool to help see which topics a person tweets about most often. It also shows the other twitter users that are mentioned most frequently in their tweets. I call it the Tweet Topic Explorer. I’m using the recently described Word Cluster Diagrams to show the most frequently used words in their tweets and how they are grouped together. This example below is for my own account, @JeffClark, and shows one word cluster containing twitter,data, visualization, list, venn, and streamgraph. Another group has word, cloud, shaped, post etc. It’s a bit hard to see in this small image but there is a cluster about Toronto where I live and mentions of run, marathon, soccer. Also, there are bubbles for some of the people on Twitter I mention the most often: @flowingdata, @eagereyes, @blprnt, @moritz_stefaner, @dougpete.

This is an interesting exercise in visualization and potentially a very useful tool.

The US ZIPScribble Map

Filed under: Mapping,Maps — Patrick Durusau @ 7:22 pm

The US ZIPScribble Map

From the post:

What would happen if you were to connect all the ZIP codes in the US in ascending order? Is there a system behind the assignment of ZIP codes? Are they organized in a grid? The result is surprising and much more interesting than expected.

The idea for the ZIPScribble came from playing with Ben Fry’s excellent zipdecode. That little applet allows you to explore the ZIP codes interactively, and reveals some very interesting patterns. What it does not give you, however, is an idea of the overall structure of the ZIP space. Jeffrey Heer has reimplemented zipdecode using his prefuse toolkit, and provides a file containing ZIP codes and coordinates. So off I went on a little programming exercise to see what simply connecting the dots would do.

Not recent (2006) but an interesting exercise. Serves as encouragement to map data to see what, if any, interesting patterns result.

The Simple Way to Scrape an HTML Table: Google Docs

Filed under: Data Mining,HTML — Patrick Durusau @ 7:22 pm

The Simple Way to Scrape an HTML Table: Google Docs

From the post:

Raw data is the best data, but a lot of public data can still only be found in tables rather than as directly machine-readable files. One example is the FDIC’s List of Failed Banks. Here is a simple trick to scrape such data from a website: Use Google Docs.

OK, not a great trick but if you are in a hurry it may be a useful one.

Of course, I get the excuse from local governments that their staff can’t export data in useful formats (I get images of budget documents in PDF files, how useful is that?).

Notation as a Tool of Thought – Iverson – Turing Lecture

Filed under: CS Lectures,Language,Language Design — Patrick Durusau @ 7:22 pm

Notation as a Tool of Thought by Kenneth E. Iverson – 1979 Turing Award Lecture

I saw this lecture tweeted with a link to a poor photocopy of a double column printing of the lecture.

I think you will find the single column version from the ACM awards site much easier to read.

Not to mention that the ACM awards site has all the Turing as well as other award lectures for viewing.

I suspect that a CS class could be taught using only ACM award lectures as the primary material. Perhaps someone already has, would appreciate a pointer if true.

Pilot

Filed under: Gremlin,Neo4j,OrientDB — Patrick Durusau @ 7:21 pm

Pilot

From the readme file:

Pilot is a graph database operator that allows you to perform common application-level operations on graph databases without delving into the details of their implementation or requiring knowledge of the component technologies.

Pilot aims to support graph databases conforming to the property graph model. Pilot employs technologies from the Tinkerpop stack — specifically Blueprints and Gremlin — for general access and manipulation of the underlying graph database, but also uses native graph database APIs to further optimize performance for certain operations. In addition, Pilot also handles multithreading and transaction management, while keeping all of these abstracted away from the calling application. As such, Pilot is ideally suited for use in concurrent web applications.

  • Supported graph database providers:
    • OrientDB
    • Neo4j
    • Tinkergraph (the Blueprints in-memory reference implementation)
    • (others may be added in future if there is demand)
  • Some of the functionality currently supported by Pilot include:
    • Get edges between given vertices
    • Get neighbors of a given vertex
    • Retrieving vertices corresponding to some properties (see Property Graph Model)
    • Transaction management
    • Thread synchronization for multithreaded access
    • Large commit optimization
    • Application profiling
  • Planned additions:

Graph databases aren’t a new idea. I don’t have the reference at hand but once ran across a relational database that was implemented as a hypergraph. It may be that computing power has finally gotten to the point that graph databases, or at least their capabilities, will be the common expectation.

Spring Data Neo4j 2.0.0.M1 Released

Filed under: Neo4j,Spring Data — Patrick Durusau @ 7:21 pm

Spring Data Neo4j 2.0.0.M1 Released

From the post:

We are pleased to announce that the first milestone release (2.0.0.M1) of the new Spring Data Neo4j major version 2.0 is now available!

In the last few weeks the engineers have been busy transforming the existing library under a new name to make it fit for its presentation at Spring One 2GX next week.

A major internal refactoring split the framework into several submodules, each addressing a different concern.

  • spring-data-neo4j: Neo4jTemplate for easy, copying object-graph-mapping, and Spring Data Repositories using persistence entity meta information
  • spring-data-neo4j-aspects: transparent object-graph-mapping using AspectJ
  • spring-data-neo4j-cross-store: AspectJ based cross-store-persistence between JPA and Neo4j
  • spring-data-neo4j-rest: transparent access of a remote Neo4j REST-Server

As part of the refactoring, the source repository was also renamed and re-organized. The previously separated examples and the tutorial project are now included directly in the same github project.

Learning Scala? Learn the Fundamentals First

Filed under: Scala,Tuples — Patrick Durusau @ 7:21 pm

Learning Scala? Learn the Fundamentals First by Craig Tataryn.

From the post:

A few weeks back I gave my talk at JavaOne 2011 titled “The Scala Language Tour”, if you’re at all interested you can grab the slides and examples from github.

The session was very well received, my only enemy was time! Given 1 hour, how does one give 170+ people a taste of all that’s Scala without completely starving them of details? Lots and lots and lots of dry-runs of your presentation, that’s how. I must have iterated my talk a dozen or more times. I just couldn’t bring myself to trimming any more fat. The short story is, I could have used 5-10 more minutes. A crucial set of slides had to be omitted concerning the “Tuple” in Scala.

Demonstrates the fundamental nature of tuples in Scala, with examples of where it can be found in Scala code.

How to create and search a Lucene.Net index…

Filed under: .Net,C#,Lucene — Patrick Durusau @ 7:21 pm

How to create and search a Lucene.Net index in 4 simple steps using C#, Step 1

From the post:

As mentioned in a previous blog, using Lucene.Net to create and search an index was quick and easy. Here I will show you in these 4 steps how to do it.

  • Create an index
  • Build the query
  • Perform the search
  • Display the results

Before we get started I wanted to mention that Lucene.Net was originally designed for Java. Because of this I think the creators used some classes in Lucene that already exist in the .Net framework. Therefore, we need to use the entire path to the classes and methods instead of using a directive to shorten it for us.

Useful for anyone exploring topic maps as a native to MS Windows application.

Lucene Search Programming

Filed under: Lucene — Patrick Durusau @ 7:21 pm

Lucene Search Programming

Nothing startling but a good review of Lucene based searching with examples.

Recommended for .Net programmers.

HBase Coprocessors – Webinar – 4 November 2011

Filed under: HBase — Patrick Durusau @ 7:20 pm

HBase Coprocessors – Deploy shared functionality directly on the cluster 4 Nomber 2011, 10 AM PT by Lars George.

From the announcement:

The newly added feature of Coprocessors within HBase allows the application designer to move functionality closer to where the data resides. While this sounds like Stored Procedures as known in the RDBMS realm, they have a different set of properties. The distributed nature of HBase adds to the complexity of their implementation, but the client side API allows for an easy, transparent access to their functionality across many servers. This session explains the concepts behind coprocessors and uses examples to show how they can be used to implement data side extensions to the application code.

For background material, you probably want to review:

Advanced HBase by Lars George (Courtesy of Alex Popescu’s myNoSQL site) it takes until slide 72 or so to reach coprocessors but you will learn a lot of stuff along the way.

Extending Query support via Coprocessor endpoints, which summarizes the uses of coprocessors as:

Coprocessors can be used for

a) observing server side operations (like the administrative kinds such as Region splits, major-minor compactions , etc) , and

b) client side operations that are eventually triggered on to the Region servers (like CRUD operations).

Another use case is letting the end user to deploy his own code (some user defined functionality) and directly invoking it from the client interface (HTable). The later functionality is called as Coprocessor Endpoints. [I introduced some paragraphing to make this more readable.]

If you have a copy of HBase: The Definitive Guide, review pages 175-199.

« Newer PostsOlder Posts »

Powered by WordPress