Archive for the ‘Uncategorized’ Category

Complete Guide to Topic Modeling (Recommender System for Email Dumps?)

Friday, January 12th, 2018

Complete Guide to Topic Modeling with scikit-learn and gensim by George-Bogdan Ivanov.

From the post:

Why is Topic Modeling useful?

There are several scenarios when topic modeling can prove useful. Here are some of them:

  • Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature
  • Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.
  • Uncovering Themes in Texts – Useful for detecting trends in online publications for example

Would a recommender system be useful for reading email dumps? 😉

Within or across candidates for Congress?

Unknown Buyers + Unknown Sellers ~= Closed Source Software

Friday, June 2nd, 2017

TurkuSec Community reports another collaborative effort to buy into the Shadow Brokers malware-of-the-month club:

“What Could Go Wrong?” is a valid question.

On the other hand, you are already spending $billions on insecure software every year.

Most of which is closed-source, meaning it may contain CIA/NSA backdoors.

A few hires in the right places and unbeknownst to the vendor, they would be distributing CIA/NSA malware.

If you credit denials of such activities by the CIA/NSA or any other government spy agency, you should stop using computers. You are a security risk to your employer.

A Shadow Brokers subscription, where 2,500 people risk $100 each for each release, on the other hand, is far safer than commercial software. If the the first release prove bogus, don’t buy a second one.

Contrast that with insecure closed source software for an OS or database that may contain CIA/NSA/etc. backdoors. You don’t get to avoid the second purchase. (You bought the maintenance package too. Am I right?)

I can’t and won’t counsel anyone to risk more than $100, but shared risk is the fundamental principle of insurance. Losses can and will happen. That’s why we distribute the risk.

That link again:

PS: Shadow Brokers: Even a list of the names with brief descriptions might help attract people who want to share the risk of subscribing. The “big” corporations are likely too arrogant to think they need the release.

#Resist vs. #EffectiveResist

Monday, February 27th, 2017

DAPL Could Be Operational In Less Than 2 Weeks

From the post:

“Dakota Access estimates and targets that the pipeline will be complete and ready to flow oil anywhere between the week of March 6, 2017, and April 1, 2017,” company attorney William Scherman said in the documents filed in Washington, D.C., on Tuesday.

Opponents to the Dakota Access Pipeline (DAPL) have two choices, #Resist or #EffectiveResist.

The new moon for February, 2017, was February 26, 2017 (yesterday). (Bookmark that link to discover other new moons in the future.)

Given the reduced visibility on nights with a new moon, you can take up rock sculpting with a thermal lance.

This is a very portable rig, but requires the same eye protection (welding goggles, no substitutes) and protective clothing as other welding activities.

Notice in the next video, which demonstrates professional grade equipment, the heavy protective headgear and clothing. Thermal lances are very dangerous and safety is your first concern.

If you create a bar-b-que pit from large pipe, follow Zippy the Razor‘s advice, “Down the block, Not across the street” to create long cuts the length of your pipe.

Will DAPL be a lesson to investors on the risk of no return from oil pipeline investments? Pending court litigation may play a role in that lesson.

Historic American Newspapers (Bulk OCR Data Find!)

Sunday, January 1st, 2017

Historic American Newspapers

From the webpage:

Search America’s historic newspaper pages from 1789-1924 or use the U.S. Newspaper Directory to find information about American newspapers published between 1690-present.

A total of 2,134 newspapers, digitized (images) and searchable. Some 11,520,159 pages for searching and review.

Quite a treasure trove for genealogy types, primary/secondary research papers, people trying to escape the smoothing influence over historical events by history books and others.

Did I mention the site has an API?

Or that it offers access to all of its OCR data in bulk?

It’s not “big data” in the sense of the astronomy community but creating sub-sets for local communities of “their papers” would have a certain cachet.


How To Brick A School Bus, Data Science Helps Park It (Part 1)

Tuesday, December 13th, 2016

Apologies for being a day late! I was working on how the New York Times acted as a bullhorn for those election interfering Russian hackers.

We left off in Data Science and Protests During the Age of Trump [How To Brick A School Bus…] with:

  • How best to represent these no free speech and/or no free assembly zones on a map?
  • What data sets do you need to make protesters effective under these restrictions?
  • What questions would you ask of those data sets?
  • How to decide between viral/spontaneous action versus publicly known but lawful conduct, up until the point it becomes unlawful?

I started this series of posts because the Women’s March on Washington wasn’t able to obtain a protest permit from the National Park Service due to a preemptive reservation by the Presidential Inauguration Committee.

Since then, the Women’s March on Washington has secured a protest permit (sic) from the Metropolitan Police Department.

If you are interested in protests organized for the convenience of government:

“People from across the nation will gather” at the intersection of Independence Avenue and Third Street SW, near the U.S. Capitol, at 10:00am” on Jan. 21, march organizers said in a statement on Friday.

Each to their own.

Bricking A School Bus

We are all familiar with the typical school bus:


By Die4kids (Own work) [GFDL or CC BY-SA 3.0], via Wikimedia Commons

The saying, “no one size fits all,” applies to the load capacity of school buses. For example, the North Carolina School Bus Safety Web posted this spreadsheet detailing the empty (column I) and maximum weight (column R) of a variety of school bus sizes. For best results, get the GVWR (Gross Vehicle Weight Rating, maximum load) for your bus and then weight it on reliable scales.

Once you determine the maximum weight capacity of your bus, divide that weight by 4,000 pounds, the weight of one cubic yard of concrete. That results is the amount of concrete that you can have poured into your bus as part of the bricking process.

I use the phrase “your bus” deliberately because pouring concrete into a school bus that doesn’t belong to you would be destruction of private property and thus a crime. Don’t commit crimes. Use your own bus.

Once the concrete has hardened (for stability), drive to a suitable location. It’s a portable barricade, at least for a while.

At a suitable location, puncture the tires on one side and tip the bus over. Remove/burn the tires.

Consulting line 37 of the spreadsheet, with that bus, you have a barricade of almost 30,000 pounds, with no wheels.


I’m still working on the data science aspects of where to park. More on that in How To Brick A School Bus, Data Science Helps Park It (Part 2), which I will post tomorrow.

“Just the texts, Ma’am, just the texts” – Colin Powell Emails Sans Attachments

Monday, October 3rd, 2016

As I reported in Bulk Access to the Colin Powell Emails – Update, I was looking for a host for the complete Colin Powell emails at 2.5 GB, but I failed on that score.

I can’t say if that result is lack of interest in making the full emails easily available or if I didn’t ask the right people. Please circulate my request when you have time.

In the meantime, I have been jumping from one “easy” solution to another, most of which involved parsing the .eml files.

But my requirement is to separate the attachment from the emails, quickly and easily. Not to parse the .eml files in preparation for further process.

How does a 22 character, command line sed expression sound?

Do you know of an “easier” solution?

sed -i '/base64/,$d' *

Reasoning the first attachment (in the event of multiple attachments) will include the string “base64” so I pass a range expression that starts there and ends at the end of the message “$” and delete that pattern, d, and write the files in place “-i.”

There are far more sophisticated solutions to this problem but as crude as this may be, I have reduced the 2.5 GB archive file that includes all the emails and their attachments down to 63 megabytes.

Attachments are important too but my first steps were to make these and similar files more accessible.

Obtaining > 29K files through the drinking straw at DCLeaks or waiting until I find a host for a consolidated 2.5 GB files, doesn’t make these files more accessible.

A 63 MB download of the Colin Powells Emails With No Attachments may.

Please feel free to mirror these files.

PS: One oddity I noticed in testing the download. With Chrome, the file size inflates to 294MB. With Mozilla, the file size is 65MB. ? Both unpack properly. Suggestions?

PPS: More sophisticated processing of the raw emails and other post-processing to follow.

Type-driven Development … [Further Reading]

Saturday, October 1st, 2016

The Further Reading slide from Edwin Brady’s presentation Type-driven Development of Communicating Systems in Idris (Lamda World, 2016) was tweeted as an image, eliminating the advantages of hyperlinks.

I have reproduced that slide with the links as follows:

Further Reading

On total functional programming

On interactive programming with dependent types

On types for communicating systems:

On Wadler’s paper, you may enjoy the video of his presentation, Propositions as Sessions or his slides (2016), Propositions as Sessions, Philip Wadler, University of Edinburgh, Betty Summer School, Limassol, Monday 27 June 2016.

Graph Computing with Apache TinkerPop

Thursday, September 29th, 2016

From the description:

Apache TinkerPop serves as an Apache governed, vendor-agnostic, open source initiative providing a standard interface and query language for both OLTP- and OLAP-based graph systems. This presentation will outline the means by which vendors implement TinkerPop and then, in turn, how the Gremlin graph traversal language is able to process the vendor’s underlying graph structure. The material will be presented from the perspective of the DSEGraph team’s use of Apache TinkerPop in enabling graph computing features for DataStax Enterprise customers.


Marko is brutally honest.

He warns the early part of his presentation is stream of consciousness and that is the truth!


That takes you to time mark 11:37 and the description of Gremlin as a language begins.

Marko slows, momentarily, but rapidly picks up speed.

Watch the video, then grab the slides and mark what has captured your interest. Use the slides as your basis for exploring Gremlin and Apache TinkerPop documentation.


Is Your IP Address Leaking? – Word for the Day: Trust – Synonym for pwned.

Wednesday, July 20th, 2016

How to See If Your VPN Is Leaking Your IP Address (and How to Stop It) by Alan Henry.

From the post:

To see if your VPN is affected:

  • Visit a site like What Is My IP Address and jot down your actual ISP-provided IP address.
  • Log in to your VPN, choose an exit server in another country (or use whichever exit server you prefer) and verify you’re connected.
  • Go back to What Is My IP Address and check your IP address again. You should see a new address, one that corresponds with your VPN and the country you selected.
  • Visit Roseler’s WebRTC test page and note the IP address displayed on the page.
  • If both tools show your VPN’s IP address, then you’re in the clear. However, if What Is My IP Address shows your VPN and the WebRTC test shows your normal IP address, then your browser is leaking your ISP-provided address to the world.

    Attempting to conceal your IP address and at the same time leaking it (one assumes unknowingly), can lead to a false sense of security.

    Follow the steps Alan outlines to test your setup.

    BTW, Alan’s post includes suggestions for how to fix the leak.

    If you blindly trust concealment measures and software, you may as well activate links in emails from your local bank.

    Word for the Day: Trust – Synonym for pwned.

    Verify your concealment on a regular basis.

    The Absence of Proof Against China on OPM Hacks

    Saturday, June 27th, 2015

    The Obama Administration has failed to release any evidence connecting China to the OPM hacks.

    Now we know why: Hacked OPM and Background Check Contractors Lacked Logs, DHS Says.

    From the post:

    Tracking everyday network traffic requires an investment and some managers decide the expense outweighs the risk of a breach going undetected, security experts say.

    In this case, taking chances has delayed a probe into the exposure of secrets on potentially 18 million national security personnel.

    Hopefully congressional hearings will expand “some managers” into a list of identified individuals.

    That is a level of incompetence that verges on the criminal.

    Not having accountability for government employees has not lead to a secure IT infrastructure. Time to try something new. Like holding all employees accountable for their incompetence.

    Debating Public Policy, On The Basis of Fictions

    Sunday, May 3rd, 2015

    Striking a Balance—Whistleblowing, Leaks, and Security Secrets by Cody Poplin.

    From the post:

    Last weekend, the New York Times published an article outlining the strength of congressional support for the CIA targeted killing program. In the story, the Times also purported to reveal the identities of three covert CIA operatives who now hold senior leadership roles within the Agency.

    As you might expect, the decision generated a great deal of controversy, which Lawfare covered here and here. Later in the week, Jack Goldsmith interviewed Executive Editor of the New York Times Dean Baquet to discuss the decision. That conversation also prompted responses from Ben, Mark Mazzetti (one of the authors of the piece), and an anonymous intelligence community reader.

    Following Times’ story, the Johns Hopkins University Center for Advanced Governmental Studies, along with the James Madison Project and our friends at Just Security, hosted an a timely conference on Secrecy, Openness and National Security: Lessons and Issues for the Next Administration. In a panel entitled Whistleblowing and America’s Secrets: Ensuring a Viable Balance, Bob Litt, General Counsel for the Office of the Director of National Security, blasted the Times, saying that the paper had “disgraced itself.”

    However, the panel—which with permission from the Center for Advanced Governmental Studies, we now present in full—covered much more than the latest leak published in the Times. In a conversation moderated by Mark Zaid, the Executive Director of the James Madison Project, Litt, along with Ken Dilanian, Dr. Gabriel Schoenfeld, and Steve Vladeck, tackled a vast array of important legal and policy questions surrounding classified leak prosecutions, the responsibilities of the press, whistleblower protections, and the future of the Espionage Act.

    It’s a jam-packed discussion full of candid exchanges—some testy, most cordial—that greatly raises the dialogue on the recent history of leaks, prosecutions, and future lessons for the next Administration.

    Spirited debate but on the basis of known fictions.

    For example, Bob Litt, General Counsel for the Office of the Director of National Security, poses a hypothetical question that compares an alleged suppression of information about the Bay of Pigs invasion to whether a news organization would be justified in leaking the details of plans to assassinate Osama bin Laden.

    The premise of the hypothetical is flawed. It is based on an alleged statement by President Kennedy wishing the New York Times had published the details in their possession. One assumes so that public reaction would have prevented the ensuing disaster.

    The story of President Kennedy suppressing a story in the New York Times about the Bay of Pigs is a myth.

    Busting the NYTimes suppression myth, 50 years on reports:

    Indeed, the Times’ purported spiking has been called the “symbolic journalistic event of the 1960s.”

    Only the Times didn’t censor itself.

    It didn’t kill, spike, or otherwise emasculate the news report published 50 years ago tomorrow that lies at the heart of this media myth.

    That article was written by a veteran Times correspondent named Tad Szulc, who reported that 5,000 to 6,000 Cuban exiles had received military training for a mission to topple Fidel Castro’s regime; the actual number of invaders was about 1,400.

    The story, “Anti-Castro Units Trained At Florida Bases,” ran on April 7, 1961, above the fold on the front page of the New York Times.

    The invasion of the Bay of Pigs happened ten days later, April 17, 1961.

    Hardly sounds like suppression of the story does it?

    That is just one fiction that formed the basis for part of the discussion in this podcast.

    Another fiction is that leaked national security information, take some of Edward Snowden‘s materials for example, were damaging to national security. Except that those who claim to know can’t say what information or how it was damaging.

    Without answers to what information and how it was damaging to national security, their claims of “damage to national security” should go straight into the myth bin. The unbroken record of leaks shows illegal activity, incompetence, waste and avoidance of responsibility. None of those are in the national interest.

    If the media does want to act in the “public interest,” then it should stop repeating unsubstantiated claims of damage to the “national interest,” by the security community. Repeated falsehoods does not make them useful for debates of public policy. When advanced such claims should be challenged and then excluded from further discussion without sufficient details for the public to reach their own conclusion about the claim.

    Another myth in this discussion is the assumption that the media has a in loco parentis role vis-a-vis the public. That media representatives should act on the public’s behalf in determining what is or is not in the “public interest.” Complete surprise to me and I have read the Constitution more than once or twice.

    I don’t remember seeing the media called out in the Constitution as guardians for a public too stupid to decide matters of public policy for itself.

    That is the central flaw with national security laws and the rights of leakers and leakees. The government of the United States, for those unfamiliar with the Constitution, is answerable under the Constitution to the citizens of the United States. Not any branch of government or its agencies but to the citizens.

    There are no exceptions to United States government being accountable to its citizens. Not one. To hold government accountable, its citizens need to know what government has been doing, to whom and why. The government has labored long and hard, especially its security services, to avoid accountability to its citizens. Starting shortly after its inception.

    There should be no penalties for leakers or leakees. Leaks will cause hardships, such as careers ending due to dishonestly, incompetence, waste and covering for others engaged in the same. If you don’t like that, move to a country where the government isn’t answerable to its citizens. May I suggest Qatar?

    Maybe Friday (17th April) or Monday (20th April) DARPA – Dark Net

    Wednesday, April 15th, 2015

    Memex In Action: Watch DARPA Artificial Intelligence Search For Crime On The ‘Dark Web’ by Thomas Fox-Brewster.

    Is DARPA’s Memex search engine a Google-killer? by Mark Stockleyhttps

    A couple of “while you wait” pieces to read while you expect part of the DARPA Memex project to appear on its Open Catalog page, either this coming Friday (17th of April) or Monday (20th of April).

    Fox-Brewster has video of a part of the system that:

    It is trying to overcome one of the main barriers to modern search: crawlers can’t click or scroll like humans do and so often don’t collect “dynamic” content that appears upon an action by a user.

    If you think searching is difficult now, with an estimated 5% of the web being indexed, just imagine bumping that up 10X or more.

    Entirely manual indexing is already impossible and you have experienced the short comings of page ranking.

    Perhaps the components of Memex will enable us to step towards a fusion of human and computer capabilities to create curated information resources.

    Imagine an electronic The Art of Computer Programming that has several human experts per chapter who are assisted by deep searching and updating references and the text on an ongoing basis? So readers don’t have to weed through all the re-inventions of particular algorithms across numerous computer and math journals.

    Or perhaps a more automated search of news reports so the earliest/most complete report is returned with the notation: “There are NNNNNN other, later and less complete versions of this story.” It isn’t that every major paper adds value, more often just content.

    BTW, the focus on the capabilities of the search engine, as opposed to the analysis of those results most welcome.

    See my post on its post-search capabilities: DARPA Is Developing a Search Engine for the Dark Web.

    Looking forward to Friday or Monday!

    Google DeepMind Resources

    Monday, March 30th, 2015

    Google DeepMind Resources

    A collection of all the Google DeepMind publications to date.

    Twenty-two (22) papers so far!

    A nice way to start your reading week!


    How To Build Linked Data APIs…

    Wednesday, October 15th, 2014

    This is the second high signal-to-noise presentation I have seen this week! I am sure that streak won’t last but I will enjoy it as long as it does.

    Resources for after you see the presentation: Hydra: Hypermedia-Driven Web APIs, JSON for Linking Data, and, JSON-LD 1.0.

    Near the end of the presentation, Marcus quotes Phil Archer, W3C Data Activity Lead:

    Archer on Semantic Web

    Which is an odd statement considering that JSON-LD 1.0 Section 7 Data Model, reads in part:

    JSON-LD is a serialization format for Linked Data based on JSON. It is therefore important to distinguish between the syntax, which is defined by JSON in [RFC4627], and the data model which is an extension of the RDF data model [RDF11-CONCEPTS]. The precise details of how JSON-LD relates to the RDF data model are given in section 9. Relationship to RDF.

    And section 9. Relationship to RDF reads in part:

    JSON-LD is a concrete RDF syntax as described in [RDF11-CONCEPTS]. Hence, a JSON-LD document is both an RDF document and a JSON document and correspondingly represents an instance of an RDF data model. However, JSON-LD also extends the RDF data model to optionally allow JSON-LD to serialize Generalized RDF Datasets. The JSON-LD extensions to the RDF data model are:…

    Is JSON-LD “…a concrete RDF syntax…” where you can ignore RDF?

    Not that I was ever a fan of RDF but standards should be fish or fowl and not attempt to be something in between.

    Speakers, Clojure/conj 2014 Washington, D.C. Nov 20-22

    Monday, September 8th, 2014

    Speakers, Clojure/conj 2014 Washington, D.C. Nov 20-22

    Hyperlinks for authors point to Twitter profile pages, title of paper follows:

    Jeanine Adkisson Variants are Not Unions

    Bozhidar Batsov The evolution of the Emacs tooling for Clojure

    Lucas Cavalcanti Exploring Four Hidden Superpowers of Datomic

    Colin Fleming Cursive: a different type of IDE

    Julian Gamble Applying the paradigms of core.async in ClojureScript

    Brian Goetz Keynote

    Paul deGrandis Unlocking Data-Driven Systems

    Nathan Herzing Helping voters with Pedestal, Datomic, Om and core.async

    Rich Hickey Transducers

    Ashton Kemerling Generative Integration Tests.

    Michał Marczyk Persistent Data Structures for Special Occasions

    Steve Miner Generating Generators

    Zach Oakes Making Games at Runtime with Clojure

    Anna Pawlicka Om nom nom nom

    David Pick Building a Data Pipeline with Clojure and Kafka

    Ghadi Shayban JVM Creature Comforts

    Chris Shea Helping voters with Pedestal, Datomic, Om and core.async

    Zach Tellman Always Be Composing

    Glenn Vanderburg Cl6: The Algorithms of TeX in Clojure

    Edward Wible Exploring Four Hidden Superpowers of Datomic

    Steven Yi Developing Music Systems on the JVM with Pink and Score

    Abstracts for the papers appear here.

    Obviously a great conference to attend but at a minimum, you have a great list of twitter accounts to follow on cutting edge Clojure news!

    I first saw this in a tweet by Alex Miller.

    (Functional) Reactive Programming (FRP) [tutorial]

    Tuesday, July 1st, 2014

    The introduction to Reactive Programming you’ve been missing by Andre Staltz.

    From the post:

    So you’re curious in learning this new thing called (Functional) Reactive Programming (FRP).

    Learning it is hard, even harder by the lack of good material. When I started, I tried looking for tutorials. I found only a handful of practical guides, but they just scratched the surface and never tackled the challenge of building the whole architecture around it. Library documentations often don’t help when you’re trying to understand some function. I mean, honestly, look at this:

    Rx.Observable.prototype.flatMapLatest(selector, [thisArg])

    Projects each element of an observable sequence into a new sequence of observable sequences by incorporating the element’s index and then transforms an observable sequence of observable sequences into an observable sequence producing values only from the most recent observable sequence.

    Holy cow.

    I’ve read two books, one just painted the big picture, while the other dived into how to use the FRP library. I ended up learning Reactive Programming the hard way: figuring it out while building with it. At my work in Futurice I got to use it in a real project, and had the support of some colleagues when I ran into troubles.

    The hardest part of the learning journey is thinking in FRP. It’s a lot about letting go of old imperative and stateful habits of typical programming, and forcing your brain to work in a different paradigm. I haven’t found any guide on the internet in this aspect, and I think the world deserves a practical tutorial on how to think in FRP, so that you can get started. Library documentation can light your way after that. I hope this helps you.

    Andre is moving in the right direction when he announces:

    FRP is programming with asynchronous data streams.

    Data streams. I have been hearing that a lot lately. 😉

    The view that data is static, file based, etc., was an artifact of our storage and processing technology. Not that data “streams” is a truer view of data but it is a different one.

    The semantic/subject identity issues associated with data don’t change whether you have a static or stream view of data.

    Although, with data streams, the processing requirements for subject identity become different. For example, with static data a change (read merger) can propagate throughout a topic map.

    With data streams, there may be no retrospective application of a new merging rule, it may only impact data streams going forward. Your view of the topic map becomes a time-based snapshot of the current state of a merged data stream.

    If you are looking for ways to explore such issues, FRP and this tutorial are a good place to start.

    How to uncover a scandal from your couch

    Wednesday, February 26th, 2014

    How to uncover a scandal from your couch by Brad Racino and Joe Yerardi.

    From the post:

    News broke in San Diego last week about a mysterious foreign national bent on influencing San Diego politics by illegally funneling money to political campaigns through a retired San Diego police detective and an undisclosed “straw donor.” Now, the politicians on the receiving end of the tainted funds are scrambling to distance themselves from the scandal.

    The main piece of information that started it all was an unsealed FBI complaint.

    Fifteen pages in length, the report named only the detective, Ernesto Encinas, and a self-described “campaign guru” in DC, Ravneet Singh, as conspirators. The names of everyone else involved were not disclosed. Instead, politicians were “candidates,” the moneymen were called “the Straw Donor” and “the Foreign National,” and informants were labeled “CIs.”

    It didn’t take long for reporters to piece together clues — mainly by combing through publicly-accessible information from the San Diego City Clerk’s website — to uncover who was involved.

    A great post that walks you through the process of taking a few facts and hints from an FBI compliant and fleshing it out with details.

    This could make an exciting exercise for a library class.

    I first saw this at: Comment and a link to a SearchResearch-like story by Daniel M. Russell.

    A Tool for Wicked Problems:…

    Tuesday, February 18th, 2014

    A Tool for Wicked Problems: Dialogue Mapping™ FAQs

    From the webpage:

    What is Dialogue Mapping™?

    Dialogue Mapping™ is a radically inclusive facilitation process that creates a diagram or ‘map’ that captures and connects participants’ comments as a meeting conversation unfolds. It is especially effective with highly complex or “Wicked” problems that are wrought with both social and technical complexity, as well as a sometimes maddening inability to move forward in a meaningful and cost effective way.

    Dialogue Mapping™ creates forward progress in situations that have been stuck; it clears the way for robust decisions that last. It is effective because it works with the non-linear way humans really think, communicate, and make decisions.

    I don’t disagree that humans really think in a non-linear way but some of that non-linear thinking is driven by self-interest, competition, and other motives that you are unlikely to capture with dialogue mapping.

    Still, to keep you from hunting for software, the CompendiumInstitute was at the Open University until early 2013.

    CompediumNG has taken over maintenance of the project.

    All three sites have videos and other materials that you may find of interest.

    If you want to go beyond dialogue mapping per se, consider augmenting a dialogue map, post-dialogue with additional information. Just as you would add information to any other subject identification.

    Or in real time if you really want a challenge.

    A live dialogue map of one of the candidate “debates” could be very amusing.

    I put “debates” in quotes because no moderator ever penalizes the participants for failing to answer questions. The faithful hear what they want to hear and strain at the mote in the opposition’s eye.

    I first saw this in a tweet by Neil Saunders.

    Rule-based deduplication…

    Friday, January 17th, 2014

    Rule-based deduplication of article records from bibliographic databases by Yu Jiang,


    We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments.

    I found this report encouraging, particularly when read along side Rule-based Information Extraction is Dead!…, with regard to merging rules authored by human editors.

    Both reports indicate a pressing need for more complex rules than matching a URI for purposes of deduplication (merging in topic maps terminology).

    I assume such rules would need to be easier for the average users to declare than TMCL.

    Light Table is open source

    Thursday, January 9th, 2014

    Light Table is open source by Chris Granger.

    From the post:

    Today Light Table is taking a huge step forward – every bit of its code is now on Github and along side of that, we’re releasing Light Table 0.6.0, which includes all the infrastructure to write and use plugins. If you haven’t been following the 0.5.* releases, this latest update also brings a tremendous amount of stability, performance, and clean up to the party. All of this together means that Light Table is now the open source developer tool platform that we’ve been working towards. Go download it and if you’re new give our tutorial a shot!

    If you aren’t already familiar with Light Table, check out The IDE as a value, also by Chris Granger.

    Just a mention in the notes, but start listening for “contextuality.” It comes up in functional approaches to graph algorithms.

    Wine Descriptions and What They Mean

    Saturday, December 14th, 2013

    Wine Descriptions and What They Mean

    wine chart

    At $22.80 for two (2), you need one of these for your kitchen and another for the office.

    Complex information doesn’t have to be displayed in a confusing manner.

    This chart is evidence of that proposition.

    BTW, the original site (see above) is interactive, zooms, etc.

    Useful Unix/Linux One-Liners for Bioinformatics

    Tuesday, October 29th, 2013

    Useful Unix/Linux One-Liners for Bioinformatics by Stephen Turner.

    From the post:

    Much of the work that bioinformaticians do is munging and wrangling around massive amounts of text. While there are some “standardized” file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Unix/Linux is extremely helpful, namely awk, sed, cut, grep, GNU parallel, and others.

    This is by no means an exhaustive catalog, but I’ve put together a short list of examples using various Unix/Linux utilities for text manipulation, from the very basic (e.g., sum a column) to the very advanced (munge a FASTQ file and print the total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, and its frequency). Most of these examples (with the exception of the SeqTK examples) use built-in utilities installed on nearly every Linux system. These examples are a combination of tactics I used everyday and examples culled from other sources listed at the top of the page.

    What one liners do you have laying about?

    For what data sets?

    Developing a Solr Plugin

    Saturday, April 27th, 2013

    Developing a Solr Plugin by Andrew Janowczyk.

    From the post:

    For our flagship product,, we strive to bring the most cutting-edge technologies to our users. As we’ve mentioned in earlier blog posts, we rely heavily on Solr and Lucene to provide the framework for these functionalities. The nice thing about the Solr framework is that it allows for easy development of plugins which can greatly extend the capabilities of the software. We’ll be creating a set of slideshares which describe how to implement 3 types of plugins so that you can get ahead of the learning curve and start extending your own custom Solr installation now.

    There are mainly 4 types of custom plugins which can be created. We’ll discuss their differences here:

    Sometimes Andrew says three (3) types of plugins and sometimes he says four (4).

    I tried to settle the question by looking at the Solr Wiki on plugins.

    Depends on how you want to count separate plugins. 😉

    But, Andrew’s advice about learning to write plugins is sound. It will put your results above those of others.

    Operation Asymptote – [PlainSite / Aaron Swartz]

    Sunday, January 20th, 2013

    Operation Asymptote

    Operation Asymptote’s goal is to make U.S. federal court data freely available to everyone.

    The data is available now, but free only up to $15 worth every quarter.

    Serious legal research hits that limit pretty quickly.

    The project does not cost you any money, only some of your time.

    The result will be another source of data to hold the system accountable.

    So, how real is your commitment to doing something effective in memory of Aaron Swartz?

    Cancer, NLP & Kaiser Permanente Southern California (KPSC)

    Sunday, August 5th, 2012

    Kaiser Permanente Southern California (KPSC) deserves high marks for the research in:

    Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm by Justin A Strauss, et. al.


    Objective Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem.

    Materials and methods SCENT employs hierarchical classification rules to identify and extract information from electronic pathology reports. Reports are analyzed and coded using a dictionary of clinical concepts and associated SNOMED codes. To assess the accuracy of SCENT, validation was conducted using manual review of pathology reports from a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California. Trained abstractors classified the malignancy status of each report.

    Results Classifications of SCENT were highly concordant with those of abstractors, achieving κ of 0.96 and 0.95 in the breast and prostate cancer groups, respectively. SCENT identified 51 of 54 new primary and 60 of 61 recurrent cancer cases across both groups, with only three false positives in 792 true benign cases. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94% in both cancer groups.

    Discussion Favorable validation results suggest that SCENT can be used to identify, extract, and code information from pathology report text. Consequently, SCENT has wide applicability in research and clinical care. Further assessment will be needed to validate performance with other clinical text sources, particularly those with greater linguistic variability.

    Conclusion SCENT is proof of concept for SAS-based natural language processing applications that can be easily shared between institutions and used to support clinical and epidemiologic research.

    Before I forget:

    Data sharing statement SCENT is freely available for non-commercial use and modification. Program source code and requisite support files may be downloaded from:

    Topic map promotion point: Application was built to account for linguistic variability, not to stamp it out.

    Tools build to fit users are more likely to succeed, don’t you think?

    Groves: The Past Is Close Behind

    Tuesday, July 3rd, 2012

    I was innocently looking for something else when I encountered:

    In HyTime ISO/IEC 10744:1997 “3. Definitions (3.35)”: graph representation of property values is ‘An abstract data structure consisting of a directed graph of nodes in which each node may be connected to other nodes by labeled arcs.’ (

    That sounds like a data structure that a property graph can represent quite easily.

    Does it sound that way to you?

    Regrets – June 18-19, 2012

    Monday, June 18th, 2012

    Apologies but I will not be making technical posts to Another Word For It on June 18th or June 19th, 2012.

    Medical testing that was supposed to end mid-day on Monday has spread over onto Tuesday. And most of Tuesday at that.

    I don’t want to post unless I think the information is useful and/or I have something useful to say about the information. I’m ok but can’t focus enough to promise either one.

    On the “bright” side, I hope to return to posting on Wednesday (June 20, 2012) and am only a few posts away from #5,000!

    I appreciate well wishes but be aware that I won’t be answering emails during this time period as well. I stole a few minutes to make this post.

    Infinite Weft (Exploring the Old Aesthetic)

    Tuesday, April 10th, 2012

    Infinite Weft (Exploring the Old Aesthetic)

    Jer Thorp writes:

    How can a textile function as a digital object? This is a central question of Infinite Weft, a project that I’ve been working on for a the last few months. The project is a collaboration with my mother, Diane Thorp, who has been weaving for almost 40 years – it’s a chance for me to combine my usually screen-based digital practice with her extraordinary hand-woven work. It’s also an exploration of mathematics, computational history, and the concept of pattern.

    Most of us probably know that the loom played a part in the early days of computing – the Jacquard loom was the first machine to use punch cards, and its workings were very influential in the early design of programmable machines (In my 1980s basement this history was actually physically embodied; sitting about 10 feet away from my mother’s two floor looms, on an Ikea bookself, sat a box of IBM punch cards that we mostly used to make paper airplanes out of). But how many of us know how a loom actually works? Though I have watched my mother weave many times, it didn’t take long at the start of this project to realize that I had no real idea how the binary weaving patterns called ‘drawdowns‘ ended up making a pattern in a textile.

    [graphic omitted]

    To teach myself how this process actually happened, I built a functional software loom, where I could see the pattern manifest itself in the warp and weft (if you have Chrome you can see it in action here – better documentation is coming soon). This gave me a kind of sandbox which let me see how typical weaving patterns were constructed, and what kind of problems I could expect when I started to write my own. And run into problems, I did. My first attempts at generating patterns were sloppy and boring (at best) and the generative methods I was applying weren’t very successful. Enter Ralph E. Griswold.

    By this point, “concept of pattern,” “punch cards,” “software loom,” and “Ralph E. Griswold,” I was completely hooked.


    Would You Know “Good” XML If It Bit You?

    Tuesday, February 14th, 2012

    XML is a pale imitation of a markup language. It has resulted in real horrors across the markup landscape. After years in its service, I don’t have much hope of that changing.

    But, the Princess of the Northern Marches has organized a war council to consider how to stem the tide of bad XML. Despite my personal misgivings, I wish them well and invite you to participate as you see fit.

    Oh, and I found this message about the council meeting:

    International Symposium on Quality Assurance and Quality Control in XML

    Monday August 6, 2012
    Hotel Europa, Montréal, Canada

    Paper submissions due April 20, 2012.

    A one-day discussion of issues relating to Quality Control and Quality Assurance in the XML environment.

    XML systems and software are complex and constantly changing. XML documents are highly varied, may be large or small, and often have complex life-cycles. In this challenging environment quality is difficult to define, measure, or control, yet the justifications for using XML often include promises or implications relating to quality.

    We invite papers on all aspects of quality with respect to XML systems, including but not limited to:

    • Defining, measuring, testing, improving, and documenting quality
    • Quality in documents, document models, software, transformations, or queries
    • Case studies in the control of quality in an XML environment
    • Theoretical or practical approaches to measuring quality in XML
    • Does the presence of XML, XML schemas, and XML tools make quality checking easier, harder, or even different from other computing environments
    • Should XML transforms and schemas be QAed as software? Or configuration files? Or documents? Does it matter?

    Paper submissions due April 20, 2012.

    Details at:

    You do have to understand the semantics of even imitation markup languages before mapping them with more robust languages. Enjoy!

    Ambiguity in the Cloud

    Thursday, December 15th, 2011

    If you are interested at all in cloud computing and its adoption, you need to read US Government Cloud Computing Technology Roadmap Volume I Release 1.0 (Draft). I know, a title like that is hardly inviting. But read it anyway. Part of a three volume set, for the other volumes see: NIST Cloud Computing Program.

    Would you care to wager on out of ten (10) requirements, how many cited a need for interoperability that is presently lacking due to different understandings, terminology, in other words, ambiguity?

    Good decision.

    The answer? 8 out of 10 requirements cited by NIST have interoperability as a component.

    The plan from NIST is to develop a common model, which will be a useful exercise, but how do we discuss differing terminologies until we can arrive at a common one?

    Or allow for discussion of previous SLAs, for example, after we have all moved onto a new terminology?

    If you are looking for a “hot” topic that could benefit from the application of topic maps (as opposed to choir programs at your local church during the Great Depression) this could be the one. One of those is a demonstration of a commercial grade technology, the other is at best a local access channel offering. You pick which is which.