Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 4, 2013

Textual Processing of Legal Cases

Filed under: Annotation,Law - Sources,Text Mining — Patrick Durusau @ 2:05 pm

Textual Processing of Legal Cases by Adam Wyner.

A presentation on Adam’s Crowdsourced Legal Case Annotation project.

Very useful if you are interested in guidance on legal case annotation.

Of course I see the UI as using topics behind the UI’s identifications and associations between those topics.

But none of that has to be exposed to the user.

GSI2013 – Geometric Science of Information

Filed under: Conferences,Geometry,Information Geometry — Patrick Durusau @ 1:12 pm

GSI2013 – Geometric Science of Information 28-08-2013 – 30-08-2013 (Paris) (program detail)

Abstracts.

From the homepage:

The objective of this SEE Conference hosted by MINES ParisTech, is to bring together pure/applied mathematicians and engineers, with common interest for Geometric tools and their applications for Information analysis, with active participation of young researchers for deliberating emerging areas of collaborative research on “Information Geometry Manifolds and Their Advanced Applications”.

I first saw this at: Geometric Science of Information (GSI): programme is out!

NLP Weather: High Pressure or Low?

Filed under: Natural Language Processing,Translation — Patrick Durusau @ 12:43 pm

Machine Translation Without the Translation by Geoffrey Pullum.

From the post:

I have been ruminating this month on why natural language processing (NLP) still hasn’t arrived, and I have pointed to three developments elsewhere that seem to be discouraging its development. First, enhanced keyword search via Google’s influentiality-ranking of results. Second, the dramatic enhancement in applicability of speech recognition that dialog design facilitates. I now turn to a third, which has to do with the sheer power of number-crunching.

Machine translation is the unclimbed Everest of computational linguistics. It calls for syntactic and semantic analysis of the source language, mapping source-language meanings to target-language meanings, and generating acceptable output from the latter. If computational linguists could do all those things, they could hang up the “mission accomplished” banner.

What has emerged instead, courtesy of Google Translate, is something utterly different: pseudotranslation without analysis of grammar or meaning, developed by people who do not know (or need to know) either source or target language.

The trick: huge quantities of parallel texts combined with massive amounts of rapid statistical computation. The catch: low quality, and output inevitably peppered with howlers.

Of course, if I may purloin Dr Johnson’s remark about a dog walking on his hind legs, although it is not done well you are surprised to find it done at all. For Google Translate’s pseudotranslation is based on zero linguistic understanding. Not even word meanings are looked up: The program couldn’t care less about the meaning of anything. Here, roughly, is how it works.

(…)

My conjecture is that it is useful enough to constitute one more reason for not investing much in trying to get real NLP industrially developed and deployed.

NLP will come, I think; but when you take into account the ready availability of (1) Google search, and (2) speech-driven applications aided by dialog design, and (3) the statistical pseudotranslation briefly discussed above, the cumulative effect is enough to reduce the pressure to develop NLP, and will probably delay its arrival for another decade or so.

Surprised to find that Geoffrey thinks more pressure will result in “real NLP,” albeit delayed by a decade or so for the reasons outlined in his post.

If you recall, machine translation of texts was the hot topic at the end of the 1950’s and early 1960’s.

With an emphasis on automatic translation of Russian. Height of the cold war so there was lots of pressure for a solution.

Lots of pressure then did not result in a solution.

There’s a rather practical reason for not investing in “real NLP.”

There is no evidence that how humans “understand” language is known well enough to program a computer to mimic that “understanding.”

If Geoffrey has evidence to the contrary, I am sure everyone would be glad to hear about it.

Speech Recognition vs. Language Processing [noise-burst classification]

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 12:06 pm

Speech Recognition vs. Language Processing by Geoffrey Pullum.

From the post:

I have stressed that we are still waiting for natural language processing (NLP). One thing that might lead you to believe otherwise is that some companies run systems that enable you to hold a conversation with a machine. But that doesn’t involve NLP, i.e. syntactic and semantic analysis of sentences. It involves automatic speech recognition (ASR), which is very different.

ASR systems deal with words and phrases rather as the song “Rawhide” recommends for cattle: “Don’t try to understand ’em; just rope and throw and brand ’em.”

Labeling noise bursts is the goal, not linguistically based understanding.

(…)

Prompting a bank customer with “Do you want to pay a bill or transfer funds between accounts?” considerably improves the chances of getting something with either “pay a bill” or “transfer funds” in it; and they sound very different.

In the latter case, no use is made by the system of the verb + object structure of the two phrases. Only the fact that the customer appears to have uttered one of them rather than the other is significant. What’s relevant about pay is not that it means “pay” but that it doesn’t sound like tran-. As I said, this isn’t about language processing; it’s about noise-burst classification.

I can see why the NLP engineers dislike Pullum so intensely.

Characterizing “speech recognition” as “noise-burst classification,” while entirely accurate, is also offensive.

😉

“Speech recognition” fools a layperson into thinking NLP is more sophisticated than it is in fact.

The question for NLP engineers is: Why the pretense at sophistication?

Understanding matrices intuitively, part 1

Filed under: Mathematics,Matrix — Patrick Durusau @ 10:27 am

Understanding matrices intuitively, part 1 by William Gould.

From the post:

I want to show you a way of picturing and thinking about matrices. The topic for today is the square matrix, which we will call A. I’m going to show you a way of graphing square matrices, although we will have to limit ourselves to the 2 x 2 case. That will be, as they say, without loss of generality. The technique I’m about to show you could be used with 3 x 3 matrices if you had a better 3-dimensional monitor, and as will be revealed, it could be used on 3 x 2 and 2 x 3 matrices, too. If you had more imagination, we could use the technique on 4 x 4, 5 x 5, and even higher-dimensional matrices.

Matrices are quite common in information retrieval texts.

William’s post is an uncommonly good explanation of how to think about and picture matrices.

I first saw this in Christophe Lalanne’s A bag of tweets / May 2013.

Excision [Forgetting But Remembering You Forgot (Datomic)]

Filed under: Datomic,Merging — Patrick Durusau @ 10:13 am

Excision

From the post:

It is a key value proposition of Datomic that you can tell not only what you know, but how you came to know it. When you add a fact:

conn.transact(list(":db/add", 42, ":firstName", "John"));

Datomic does more than merely record that 42‘s first name is “John“. Each datom is also associated with a transaction entity, which records the moment (:db/txInstant) the datom was recorded.

(…)

Given this information model, it is easy to see that Datomic can support queries that tell you:

  • what you know now
  • what you knew at some point in the past
  • how and when you came to know any particular datom

So far so good, but there is a fly in the ointment. In certain situations you may be forced to excise data, pulling it out root and branch and forgetting that you ever knew it. This may happen if you store data that must comply with privacy or IP laws, or you may have a regulatory requirement to keep records for seven years and then “shred” them. For these scenarios, Datomic provides excision.

One approach to the unanswered question of what does it means to delete something from a topic map?

Especially interesting because you can play with the answer that Datomic provides.

Doesn’t address the issue of what it means to delete a topic that has caused other topics to merge.

I first saw this in Christophe Lalanne’s A bag of tweets / May 2013.

June 3, 2013

Putting All Species In A Graph Database [Thursday, June 6th, 13:00 EDT]

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:26 pm

Putting All Species In A Graph Database

From the post:

Stephen Smith, an ecology and evolutionary biology professor at the University of Michigan, is going to explain how Neo4j and other digital technologies are assisting in constructing the tree of life. Starting at 10:00 PDT (19:00 CEST), he will also discuss other aspects of the interface of biology with next generation technologies.

“Our project is building the tools with which scientists in the community can continually improve the tree of life as we gather new information. Neo4j allows us to not only store trees in their native graph form, but also allows us to map trees to the same structure, the graph. So in fact, we are facilitating the construction of the graph of life,” says Smith.

Neo4j approached the Open Tree of Life team to present a webinar because it is a project that utilizes the Neo4j graph database to represent the interconnectedness of biological data. The company considers the project a great example of how a graph database can better model the natural world.

The online lecture is intended for a broad audience including beginner computer programmers, advanced hackers, data scientists, natural scientists, and anyone interested in the cross-section of science and technology, especially data modeling. Over 150 people have already registered online.

The registration form: [Registration]

This should be fun.

Get your questions ready!

My questions:

What happens when our understanding of the tree of life changes?

How do we preserve the “old” and “new” understandings of the tree of life?

Can we compare those understandings?

Searching With Hierarchical Fields Using Solr

Filed under: Searching,Solr — Patrick Durusau @ 3:05 pm

Searching With Hierarchical Fields Using Solr by John Berryman.

From the post:

In our recent and continuing effort to make the world a better place, we have been working with the illustrious Waldo Jaquith on a project called StateDecoded. Basically, we’re making laws easily searchable and accessible by the layperson. Here, check out the predecessor project, Virginia Decoded. StateDecoded will be similar to the predecessor but with extended and matured search capabilities. And instead of just Virginia state law, with StateDecoded, any municipality will be able to download the open source project index their own laws and give their citizens better visibility in to the rules that govern them.

For this post, though, I want to focus upon one of the good Solr riddlers that we encountered related to the hierarchical nature of the documents being indexed. Laws are divided into sections, chapters, and paragraphs and we have documents at every level. In our Solr, this hierarchy is captured in a field labeled “section”. So for instance, here are 3 examples of this section field:

  • <field name="section">30</field> – A document that contains information specific to section 30.
  • <field name="section">30.4</field> – A document that contains information specific to section 30 chapter 4.
  • <field name="section">30.4.15</field> – A document that contains information specific to section 30 chapter 4 paragraph 15.

And our goal for this field is that if anyone searches for a particular section of law, that they will be given the law most specific to their request followed by the laws that are less specific. For instance, if a user searches for “30.4″, then the results should contain the documents for section 30, section 30.4, section 30.4.15, section 30.4.16, etc., and the first result should be for 30.4. Other documents such as 40.4 should not be returned.

(…)

Excellent riddler!

I suspect the same issue comes up in other contexts as well.

Content-Negotiation for WorldCat

Filed under: Linked Data,WorldCat — Patrick Durusau @ 2:44 pm

Content-Negotiation for WorldCat by Richard Wallis.

From the post:

I am pleased to share with you a small but significant step on the Linked Data journey for WorldCat and the exposure of data from OCLC.

Content-negotiation has been implemented for the publication of Linked Data for WorldCat resources.

For those immersed in the publication and consumption of Linked Data, there is little more to say. However I suspect there are a significant number of folks reading this who are wondering what the heck I am going on about. It is a little bit techie but I will try to keep it as simple as possible.

Back last year, a linked data representation of each (of the 290+ million) WorldCat resources was imbedded in it’s web page on the WorldCat site. For full details check out that announcement but in summary:

  • All resource pages include Linked Data
  • Human visible under a Linked Data tab at the bottom of the page
  • Embedded as RDFa within the page html
  • Described using the Schema.org vocabulary
  • Released under an ODC-BY open data license

That is all still valid – so what’s new from now?

That same data is now available in several machine readable RDF serialisations. RDF is RDF, but dependant on your use it is easier to consume as RDFa, or XML, or JSON, or Turtle, or as triples.

In many Linked Data presentations, including some of mine, you will hear the line “As I clicked on the link a web browser we are seeing a html representation. However if I was a machine I would be getting XML or another format back.” This is the mechanism in the http protocol that makes that happen.

I use WorldCat often. It enables readers to search for a book at their local library or to order online.

(Re)imagining the Future of Work

Filed under: Crowd Sourcing,Interface Research/Design — Patrick Durusau @ 2:19 pm

(Re)imagining the Future of Work by Tatiana.

From the post:

Here at CrowdFlower, our Product and Engineering teams are a few months into an ambitious project: building everything we’ve learned about crowdsourcing in the past five years as industry leaders into a new, powerful and intuitive platform.

Today, we’re excited to kick off a monthly blog series that gives you an insider pass to our development process.

Here, we’ll cover the platform puzzles CrowdFlower wrestles with everyday:

  • How do we process 4 million human judgments per day with a relatively small engineering team?
  • Which UX will move crowdsourcing from the hands of early adopters into the hands of every business that requires repetitive, online work?
  • What does talent management mean in an online crowd of millions?
  • Can we become an ecosystem for developers who want to build crowdsourcing apps and tools for profit?
  • Most of all: what’s it like to rebuild a platform that carries enormous load… a sort of pit-crewing of the car while it’s hurtling around the track, or multi-organ transplant.

Our first post next week will dive into one of our recent projects: the total rewrite of our worker interface. It’s common lore that engaging in a large code-rewrite project is risky at best, and a company-killer at worst. We’ll tell you how we made it through with only a few minor scrapes and bruises, and many happier workers.

Questions:

How is a crowd different from the people who work for your enterprise?

If you wanted to capture the institutional knowledge of your staff, would the interface look like a crowd-source UI?

Should capturing institutional knowledge be broken into small tasks?

Important lessons for interfaces may emerge from this series!

The Banality of ‘Don’t Be Evil’

Filed under: Government,Government Data,Privacy — Patrick Durusau @ 2:00 pm

The Banality of ‘Don’t Be Evil’ by Julian Assange.

From the post:

“THE New Digital Age” is a startlingly clear and provocative blueprint for technocratic imperialism, from two of its leading witch doctors, Eric Schmidt and Jared Cohen, who construct a new idiom for United States global power in the 21st century. This idiom reflects the ever closer union between the State Department and Silicon Valley, as personified by Mr. Schmidt, the executive chairman of Google, and Mr. Cohen, a former adviser to Condoleezza Rice and Hillary Clinton who is now director of Google Ideas.

The authors met in occupied Baghdad in 2009, when the book was conceived. Strolling among the ruins, the two became excited that consumer technology was transforming a society flattened by United States military occupation. They decided the tech industry could be a powerful agent of American foreign policy.

The book proselytizes the role of technology in reshaping the world’s people and nations into likenesses of the world’s dominant superpower, whether they want to be reshaped or not. The prose is terse, the argument confident and the wisdom — banal. But this isn’t a book designed to be read. It is a major declaration designed to foster alliances.

“The New Digital Age” is, beyond anything else, an attempt by Google to position itself as America’s geopolitical visionary — the one company that can answer the question “Where should America go?” It is not surprising that a respectable cast of the world’s most famous warmongers has been trotted out to give its stamp of approval to this enticement to Western soft power. The acknowledgments give pride of place to Henry Kissinger, who along with Tony Blair and the former C.I.A. director Michael Hayden provided advance praise for the book.

In the book the authors happily take up the white geek’s burden. A liberal sprinkling of convenient, hypothetical dark-skinned worthies appear: Congolese fisherwomen, graphic designers in Botswana, anticorruption activists in San Salvador and illiterate Masai cattle herders in the Serengeti are all obediently summoned to demonstrate the progressive properties of Google phones jacked into the informational supply chain of the Western empire.

(…)

I am less concerned with privacy and more concerned with the impact of technological imperialism.

I see no good coming from the infliction of Western TV and movies on other cultures.

Or in making local farmers part of the global agriculture market.

Or infecting Iraq with sterile wheat seeds.

Compared to those results, privacy is a luxury of the bourgeois who worry about such issues.

I first saw this at Chris Blattman’s Links I liked.

The 3 Vs of Big Data revisited: Venn diagrams and visualization

Filed under: Graphics,Venn Diagrams,Visualization — Patrick Durusau @ 1:21 pm

The 3 Vs of Big Data revisited: Venn diagrams and visualization by Vincent Granville.

From the post:

This discussion is about visualization. The three Vs of big data (volume, velocity, variety) or the three skills that make a data scientist (hacking, statistics, domain expertise) are typically visualized using a Venn diagram, representing all the potential 8 combinations through set intersections. In the case of big data, I believe (visualization, veracity, value) are more important than (volume, velocity, variety), but that’s another issue. Except that one of my Vs is visualization and all these Venn diagrams are visually wrong: the color at the intersection of two sets should be the blending of both colors of the parent sets, for easy interpretation and easy generalization to 4 or more sets. For instance, if we have three sets A, B, C painted respectively in red, green, blue, the intersection of A and B should be yellow, the intersection of the three should be white.

Sorry to disappoint fans of the “3 Vs of Big Data,” as Vincent points out there are at least six (6). (Probably more. Post your suggestions.)

It is a helpful review on Venn diagrams until Vincent says:

For most people, the brain has a hard time quickly processing more than 4 dimensions at once, and this should be kept in mind when producing visualizations. Beyond 5 dimensions, any additional dimension probably makes your visual less and less useful for value extraction, unless you are a real artist!

I don’t think four dimensions is going to be easy:

4dimension

3D projection of a tesseract undergoing a simple rotation in four dimensional space.

“Why don’t libraries get better the more they are used?” [Librarians Do]

Filed under: Librarian/Expert Searchers,Library,Topic Maps — Patrick Durusau @ 12:59 pm

“Why don’t libraries get better the more they are used?”

From the post:

On June 19-20, 2013, the 8th Handheld Librarian Online Conference will take place, an online conference about encouraging innovation inside libraries.

Register now, as an individual, group or site, and receive access to all interactive, live online events and recordings of the sessions!

(…)

The keynote presentation is delivered by Michael Edson, Smithsonian Institution’s Director of Web and New Media Strategy, and is entitled “Faking the Internet”. His central question:

“Why don’t libraries get better the more they are used? Not just a little better—exponentially better, like the Internet. They could, and, in a society facing colossal challenges, they must, but we won’t get there without confronting a few taboos about what a library is, who it’s for, and who’s in charge.”

I will register for this conference.

Mostly to hear Michael Edson’s claim that the Internet has gotten “exponentially better.”

In my experience (yours?), the Internet has gotten exponentially noisier.

If you don’t believe me, write down a question (not the query) and give it to ten (10) random people outside your IT department or library.

Have them print out the first page of search results.

Enough proof?

Edson’s point that information resources should improve with use, on the other hand, is a good one.

For example, contrast your local librarian with a digital resource.

The more questions your librarian fields, the better they become with related information and resources on any subject.

A digital resource which no matter how many times it is queried, the result it returns will always be the same.

A librarian is a dynamic accumulator of information and relationships between information. A digital resource is a static reporter of information.

Unlike librarians, digital resources aren’t designed to accumulate new information or relationships between information from users at the point of interest. (A blog response several screen scrolls away is unseen and unhelpful.)

What we need are UIs for digital resources that enable users to map into those digital resources their insights, relationships and links to other resources.

In their own words.

That type of digital resource could become “exponentially better.”

Coming soon: new, expanded edition of “Learning SPARQL”

Filed under: RDF,Semantic Web,SPARQL — Patrick Durusau @ 9:49 am

Coming soon: new, expanded edition of “Learning SPARQL” by Bob DuCharme.

From the post:

55% more pages! 23% fewer mentions of the semantic web!

sparql

I’m very pleased to announce that O’Reilly will make the second, expanded edition of my book Learning SPARQL available sometime in late June or early July. The early release “raw and unedited” version should be available this week.

I wonder if Bob is going to start an advertising trend with “fewer mentions of the semantic web?”

😉

Looking forward to the update!

Not that I care about SPARQL all that much but I’ll learn something.

Besides, I have liked Bob’s writing style from back in the SGML days.

How I became a password cracker [Cure for Travel Boredom]

Filed under: Cybersecurity,Security — Patrick Durusau @ 9:13 am

How I became a password cracker by Nate Anderson.

From the post:

At the beginning of a sunny Monday morning earlier this month, I had never cracked a password. By the end of the day, I had cracked 8,000. Even though I knew password cracking was easy, I didn’t know it was ridiculously easy—well, ridiculously easy once I overcame the urge to bash my laptop with a sledgehammer and finally figured out what I was doing.

My journey into the Dark-ish Side began during a chat with our security editor, Dan Goodin, who remarked in an offhand fashion that cracking passwords was approaching entry-level “script kiddie stuff.” This got me thinking, because—though I understand password cracking conceptually—I can’t hack my way out of the proverbial paper bag. I’m the very definition of a “script kiddie,” someone who needs the simplified and automated tools created by others to mount attacks that he couldn’t manage if left to his own devices. Sure, in a moment of poor decision-making in college, I once logged into port 25 of our school’s unguarded e-mail server and faked a prank message to another student—but that was the extent of my black hat activities. If cracking passwords were truly a script kiddie activity, I was perfectly placed to test that assertion.

It sounded like an interesting challenge. Could I, using only free tools and the resources of the Internet, successfully:

  1. Find a set of passwords to crack
  2. Find a password cracker
  3. Find a set of high-quality wordlists and
  4. Get them all running on commodity laptop hardware in order to
  5. Successfully crack at least one password
  6. In less than a day of work?

I could. And I walked away from the experiment with a visceral sense of password fragility. Watching your own password fall in less than a second is the sort of online security lesson everyone should learn at least once—and it provides a free education in how to build a better password.

Have Nate’s post on a USB stick along with some data and tools during summer travel.

If the kids get bored, put them to cracking passwords.

Think of it as the 21st century version of counting license plates on other cars.

I first saw this in a tweet by Mitch Kapor.

CIA, Solicitation and Government Transparency

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 8:26 am

IBM battles Amazon over $600M CIA cloud deal by Frank Konkel, reports that IBM has protested a contract award for cloud computing by the CIA to Amazon.

The “new age” of government transparency looks a lot like the old age in that:

  • How Amazon obtained the award is not public.
  • The nature of the cloud to be built by Amazon is not public.
  • Whether Amazon has started construction on the proposed cloud is not public.
  • The basis for the protest by IBM is not public.

“Not public” means opportunities for incompetence in contract drafting and/or fraud by contractors.

How are members of the public or less well-heeled potential bidders suppose to participate in this discussion?

Or should I say “meaningfully participate” in the discussion over the cloud computing award to Amazon?

And what if others know the terms of the contract? CIA CTO Gus Hunt is reported as saying:

It is very nearly within our grasp to be able to compute on all human generated information,

If the proposed system is supposed to “compute on all human generated information,” so what?

How does knowing that aid any alleged enemies of the United States?

Other than the comfort that the U.S. makes bad technology decisions?

Keeping the content of such a system secret might disadvantage enemies of the U.S.

Keeping the contract for such a system secret disadvantages the public and other contractors.

Yes?

June 2, 2013

Hadoop REST API – WebHDFS

Filed under: Hadoop,HDFS — Patrick Durusau @ 6:57 pm

Hadoop REST API – WebHDFS by Istvan Szegedi.

From the post:

Hadoop provides a Java native API to support file system operations such as create, rename or delete files and directories, open, read or write files, set permissions, etc. A very basic example can be found on Apache wiki about how to read and write files from Hadoop.

This is great for applications running within the Hadoop cluster but there may be use cases where an external application needs to manipulate HDFS like it needs to create directories and write files to that directory or read the content of a file stored on HDFS. Hortonworks developed an additional API to support these requirements based on standard REST functionalities.

Something to add to your Hadoop toolbelt.

Take DMX-h ETL Pre-Release for a Test Drive!

Filed under: ETL,Hadoop,Integration — Patrick Durusau @ 6:49 pm

Take DMX-h ETL Pre-Release for a Test Drive! by Keith Kohl.

From the post:

Last Monday, we announced two new DMX-h Hadoop products, DMX-h Sort Edition and DMX-h ETL Edition. Several Blog posts last week included why I thought the announcement was cool and also some Hadoop benchmarks on both TeraSort and also running ETL.

Part of our announcement was the DMX-h ETL Pre-Release Test Drive. The test drive is a trial download of our DMX-h ETL software. We have installed our software on our partner Cloudera’s VM (VMware) image complete with the user case accelerators, sample data, documentation and even videos. While the download is a little large ─ ok it’s over 3GB─ it’s a complete VM with Linux and Cloudera’s CDH 4.2 Hadoop release (the DMX-h footprint is a mere 165MB!).

Cool!

Then Keith asks later in the post:

The test drive is not your normal download. This is actually a pre-release of our DMX-h ETL product offering. While we have announced our product, it is not generally available (GA) yet…scheduled for end of June. We are offering a download of a product that isn’t even available yet…how many vendors do that?!

Err, lots of them? It’s call a beta/candidate/etc release?

😉

Marketing quibbles aside, it does sound quite interesting.

In some ways I would like to see the VM release model become more common.

Test driving software should not be a install/configuration learning experience.

That should come after users are interested in the software.

BTW, interesting approach, at least reading the webpages/documentation.

Doesn’t generate code for conversion/ETL so there is no code to maintain. Written against the DMX-h engine.

Need to think about what that means in terms of documenting semantics.

Or reconciling different ETL approaches in the same enterprise.

More to follow.

solrconfig.xml: …

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 11:58 am

solrconfig.xml: Understanding SearchComponents, RequestHandlers and Spellcheckers by Mark Bennett.

I spend most of my configuration time in Solr’s schema.xml, but the solrconfig.xml is also a really powerful tool. I wanted to use my recent spellcheck configuration experience to review some aspects of this important file. Sure, solrconfig.xml lets you configure a bunch of mundane sounding things like caching policies and library load paths, but it also has some high-tech configuration “LEGO blocks” that you can mix and match and re-assemble into all kinds of interesting Solr setups.

What is spell checking if it isn’t validation of a name? 😉

If you like knowing the details, this is a great post!

MapReduce with Python and mrjob on Amazon EMR

Filed under: Amazon EMR,MapReduce,Natural Language Processing,Python — Patrick Durusau @ 10:59 am

MapReduce with Python and mrjob on Amazon EMR by Sujit Pal.

From the post:

I’ve been doing the Introduction to Data Science course on Coursera, and one of the assignments involved writing and running some Pig scripts on Amazon Elastic Map Reduce (EMR). I’ve used EMR in the past, but have avoided it ever since I got burned pretty badly for leaving it on. Being required to use it was a good thing, since I got over the inertia and also saw how much nicer the user interface had become since I last saw it.

I was doing another (this time Python based) project for the same class, and figured it would be educational to figure out how to run Python code on EMR. From a quick search on the Internet, mrjob from Yelp appeared to be the one to use on EMR, so I wrote my code using mrjob.

The code reads an input file of sentences, and builds up trigram, bigram and unigram counts of the words in the sentences. It also normalizes the text, lowercasing, replacing numbers and stopwords with placeholder tokens, and Porter stemming the remaining words. Heres the code, as you can see, its fairly straightforward:

Knowing how to exit and confirm exit from a cloud service are the first things to learn about a cloud system.

Pondering Bibliographic Coupling…

Filed under: Citation Analysis,Corporate Data,Graphs — Patrick Durusau @ 10:25 am

Pondering Bibliographic Coupling and Co-citation Analyses in the Context of Company Directorships by Tony Hirst.

From the post:

Over the last month or so, I’ve made a start reading through Mark Newman’s Networks: An Introduction, trying (though I’m not sure how successfully!) to bring an element of discipline to my otherwise osmotically acquired understanding of the techniques employed by various network analysis tools.

One distinction that made a lot of sense to me came from the domain of bibliometrics, specifically between the notions of bibliographic coupling and co-citation.

Co-citation

The idea of co-citation will be familiar to many – when one article cites a set of other articles, those other articles are “co-cited” by the first. When the same articles are co-cited by lots of other articles, we may have reason to believe that they are somehow related in a meaningful way.

(…)

Bibliographic coupling

Bibliographic coupling is actually an earlier notion, describing the extent to which two works are related by virtue of them both referencing the same other work.

Interesting musings about applying well known views of bibliographic graphs to graphs composed of company directorships.

Tony’s suggestion of watching for patterns in directors moving together between companies is a good one but I would broaden the net a bit.

Why not track school, club, religious affiliations, etc.? All of those form networks as well.

Rings — A Second Primer

Filed under: Mathematical Reasoning,Mathematics — Patrick Durusau @ 10:01 am

Rings — A Second Primer by Jeremy Kun.

From the post:

Last time we defined and gave some examples of rings. Recapping, a ring is a special kind of group with an additional multiplication operation that “plays nicely” with addition. The important thing to remember is that a ring is intended to remind us arithmetic with integers (though not too much: multiplication in a ring need not be commutative). We proved some basic properties, like zero being unique and negation being well-behaved. We gave a quick definition of an integral domain as a place where the only way to multiply two things to get zero is when one of the multiplicands was already zero, and of a Euclidean domain where we can perform nice algorithms like the one for computing the greatest common divisor. Finally, we saw a very important example of the ring of polynomials.

In this post we’ll take a small step upward from looking at low-level features of rings, and start considering how general rings relate to each other. The reader familiar with this blog will notice many similarities between this post and our post on group theory. Indeed, the definitions here will be “motivated” by an attempt to replicate the same kinds of structures we found helpful in group theory (subgroups, homomorphisms, quotients, and kernels). And because rings are also abelian groups, we will play fast and loose with a number of the proofs here by relying on facts we proved in our two-post series on group theory. The ideas assume a decidedly distinct flavor here (mostly in ideals), and in future posts we will see how this affects the computational aspects in more detail.

I have a feeling that Jeremy’s posts are eventually going to lead to a very good tome with a title like: Mathematical Foundations of Programming.

To be used when you want to analyze and/or invent algorithms, not simply use them.

Notable presentations at Technion TCE conference 2013: RevMiner & Boom

Filed under: Artificial Intelligence,Recommendation — Patrick Durusau @ 9:48 am

Notable presentations at Technion TCE conference 2013: RevMiner & Boom by Danny Bickson.

Danny has uncovered two papers to start your week:

http://turing.cs.washington.edu/papers/uist12-huang.pdf (RevMiner)

http://turing.cs.washington.edu/papers/kdd12-ritter.pdf (Twitter data mining)

Danny also describes Boom, for which I found this YouTube video:

See Danny’s post for more comments, etc.

White House Releases New Tools… [Bank Robber’s Defense]

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 9:27 am

White House Releases New Tools For Digital Strategy Anniversary by Caitlin Fairchild.

From the post:

The White House marked the one-year anniversary of its digital government strategy Thursday with a slate of new releases, including a catalog of government APIs, a toolkit for developing government mobile apps and a new framework for ensuring the security of government mobile devices.

Those releases correspond with three main goals for the digital strategy: make more information available to the public; serve customers better; and improve the security of federal computing.

Just scanning down the API list, it is a very mixed bag.

For example, there are four hundred and ten (410) individual APIs listed, the National Library of Medicine has twenty-four (24) and the U.S. Senate has one (1).

Defenders of this release will say we should not talk about the lack of prior efforts but focus on what’s coming.

I call that the bank robber’s defense.

All prosecutors want to talk about is what a bank robber did in the past. They never want to focus on the future.

Bank robbers would love to have the “let’s talk about tomorrow” defense.

As far as I know, it isn’t allowed anywhere.

Question: Why do we allow avoidance of responsibility with the “let’s talk about tomorrow” defense for government and others?

If you review the APIs for semantic diversity I would appreciate a pointer to your paper/post.

UK Income Tax and National Insurance

Filed under: Graphics,Marketing,Visualization — Patrick Durusau @ 8:53 am

http://www.youtube.com/watch?feature=player_embedded&v=C9ZMgG9NiUs

Now if I could just come up with the equivalent of this video for semantic diversity.

Suggestions?

I first saw this at Randy Krum’s Cool Infographics.

Stadtbilder — mapping the digital shape of cities [Dynamic Crime Maps?]

Filed under: Graphics,Visualization — Patrick Durusau @ 8:45 am

Stadtbilder — mapping the digital shape of cities by Moritz Stefaner.

From the post:

Stadtbilder (“city images”) is a new little side project of mine — an attempt to map the digital shape of cities. I am increasingly fascinated by the idea of mapping the “real world” — life and culture as opposed to just physical infrastructure — and when I learned about the really deep datasets Georgi from Uberblic had been collecting, I just had to work with the data.

The maps show an overlay of all the digitally marked “hotspots” in a city, such as restaurant, hotels, clubs, etc. collected from different service like yelp, or foursquare. What they don’t show are the streets, the railroads, the buildings. I wanted to to portray the living parts of the cities as opposed to the technical/physical infrastructure you usually see on maps.The only exception are the rivers and lakes, because I felt they help a lot in orienting on these fairly abstract maps.

Great graphics!

An idea with lots of possibilities!

The only mention of crime in Wikipedia for Atlanta, GA is:

Northwest Atlanta, marked by Atlanta’s poorest and most crime-ridden neighborhoods, has been the target of community outreach programs and economic development initiatives. (article with map)

As a matter of strict geographic description, that’s true, but not much help to a visitor.

More helpful would be a heat map of crime (by crime?) that changes color by the hour of the day on the local transit system (MARTA).

Is there an iPhone app for that?

I don’t have an iPhone so don’t keep up with it.

If you wanted to take that a step further, offer pictures of people wanted for crimes in particular areas.

June 1, 2013

Asset Description Metadata Schema (ADMS)

Filed under: ADMS,Linked Data — Patrick Durusau @ 7:41 pm

Asset Description Metadata Schema (ADMS)

Abstract:

The Asset Description Metadata Schema (ADMS) is a common way to describe semantic interoperability assets making it possible for everyone to search and discover them once shared through the forthcoming federation of asset repositories.

Please consult the ADMS brochure for further introduction.

The 1.0 version was released 18 April 2012.

The 1.0 version was contributed to the W3C Government Linked Data (GLD) Working Group.

Wikipedia reports that the ADMS page at Wikipedia, http://en.wikipedia.org/wiki/Asset_Description_Metadata_Schema is an “orphan” page.

That is no other pages link to it.

Just in case you are looking for a weekend project.

Adventures in Advanced Symbolic Programming

Filed under: Additivity,Programming,Propagator,Symbol — Patrick Durusau @ 4:31 pm

Adventures in Advanced Symbolic Programming by Gerald Jay Sussman with Pavel Panchekha.

Description:

Concepts and techniques for the design and implementation of large software systems that can be adapted to uses not anticipated by the designer. Applications include compilers, computer-algebra systems, deductive systems, and some artificial intelligence applications. Means for decoupling goals from strategy. Mechanisms for implementing additive data-directed invocation. Working with partially-specified entities. Managing multiple viewpoints. Topics include combinators, generic operations, pattern matching, pattern-directed invocation, rule systems, backtracking, dependencies, indeterminacy, memoization, constraint propagation, and incremental refinement. Substantial weekly programming assignments are an integral part of the subject.

I was searching for updates on the Revised Report on the Propagator Model when I discovered this course and its readings page.

From the readings page:

Propagator Language and System

Suggestion: Don’t cheat yourself by focusing too narrowly. All of the reading material is worthwhile.

4Store

Filed under: 4store,RDF,Semantic Web — Patrick Durusau @ 4:00 pm

4Store

From the about page:

4store is a database storage and query engine that holds RDF data. It has been used by Garlik as their primary RDF platform for three years, and has proved itself to be robust and secure.

4store’s main strengths are its performance, scalability and stability. It does not provide many features over and above RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist.

This was mentioned in a report by Bryan Thompson so I wanted to capture a link to the original site.

The latest tarball is dated 10 July 2012.

1000 Genomes…

Filed under: 1000 Genomes,Bioinformatics,Genomics — Patrick Durusau @ 3:44 pm

1000 Genomes: A Deep Catalog of Human Genetic Variation

From the overview:

Recent improvements in sequencing technology (“next-gen” sequencing platforms) have sharply reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.

As with other major human genome reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases. (See Data use statement.)

The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person’s genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person’s DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.

Sequencing is still too expensive to deeply sequence the many samples being studied for this project. However, any particular region of the genome generally contains a limited number of haplotypes. Data can be combined across many samples to allow efficient detection of most of the variants in a region. The Project currently plans to sequence each sample to about 4X coverage; at this depth sequencing cannot provide the complete genotype of each sample, but should allow the detection of most variants with frequencies as low as 1%. Combining the data from 2500 samples should allow highly accurate estimation (imputation) of the variants and genotypes for each sample that were not seen directly by the light sequencing.

If you are looking for a large data set other than CiteSeer and DBpedia, you should consider something from the 1000 Genomes project.

Lots of significant data with more on the way.

« Newer PostsOlder Posts »

Powered by WordPress