Plagiarism detection is a form of detecting subject-sameness.
If you think of a document as a subject and say 95% of it is the same as another document, you could conclude that it is the same subject. (Or set your own level of duplication for subject-sameness.)
One of the early use cases for topic maps was avoiding the duplication of documentation (and billing for the same) for defense systems.
Detecting self-plagiarism from a law firm, vendor, contractor, consultant is one thing.
Putting those incidents together across a government agency, business, institution, or enterprise is a job for topic maps.
The SciDB project illustrates that there is no general case solution for semantic identity.
If we distinguish between IRIs as addresses versus IRIs as identifiers, IRIs are useful for some cases of semantic identity. (IRIs can be used even if you don’t make that distinction, but they are less useful.)
But can you imagine an IRI for each tuple of values in the some 15 petabytes of data annually from the Large Hadron Collider? It may be very important to identify any number of those tuples. Such as if (not when) they discover the Higgs boson.
Those tuples have semantic identity, as do subjects composed of those tuples.
Rather than seeking general solutions for all semantic identity, perhaps we should find solutions that work for particular cases.
I haven’t looked at the documents but document collections present the same issues for effective use.
First, document semantics vary depending upon whether they are being read by their intended audience, another military command or other audience. For example, locations may be identified by unfamiliar terms.
Second, and nearly as important, what if one analyst bridges the different semantics and identifies a location? How do they map it to their semantic and communicate that fact to others?
Could pass around a sticky note. Put it on a blackboard. Write it up in a multi-page report.
Topic maps are an effective means to navigate data and multiple interpretations of it, not to mention integrating other data you may have on hand.
Topic maps don’t constrain what subjects you can identify in advance, the basis on which you identify them, and can quickly share discoveries with others.
Wikileaks can be annoying. Topic maps can make Wikileaks effective. There’s a difference.
Short Title
section 1. This Act may be cited as the “______Act of____”.
Ask yourself: How would topic maps lead to a different result? (Ok, that probably wasn’t your first thought, work with me here.)
If bills were treated as subjects, represented by topics, using TMCL, we can specify that every topic of type “House Bill” has to have one and only one name.
houseBill isa tmcl:topic-type;
has-name(tmdm:topic-name, 1, 1).
Which says every topic of House Bill type has one and only one name. And we should get an error warning if is it missing.
If that seems like a lot of trouble fix a work flow proofing glitch, consider this:
U.S. legislation typically runs hundreds, even thousands of pages with provisions that are relevant to particular constituencies. What if all those provisions and their constituencies were treated as subjects, represented by topics?
Everyone could read those provisions of interest to them or the ones they were interested in opposing (possibly the more popular of the two). Instead of 2,000 pages you might need to read only 3 to 5 pages.
Reading maybe 3 to 5 pages sounds more like transparency to me than dumping 2,000+ pages on my desk and calling it “transparency.”
******
PS: My suggestion to fix the bill title: “Last Opaque Act of 2010.” Whether lobbyists, elected officials and agencies can hear it or not, transparency is coming, to the USA.
The basic idea is that an organization should have one uniform way to talk about its non-transactional entities. In topic map land we would say subjects.
OK, but here’s come the payoff question: How does the organization deal with heterogeneous data from others?
Ah, yes, well, hmmm, …..that wasn’t part of our MDM contract.
You can be an island of pure data (ghetto?) in a heterogeneous world (MDM) or you can play well with others (topic maps). Which do you think offers the most commercial advantage?
To answer it I plan on blogging about an opportunity for the use of topic maps every week. Maybe a project, a software package, etc., but in all cases, an instance where topic maps would make a positive difference. Suggestions about opportunities that I should blog about are most welcome.
Watch this blog for my first “opportunity for topic maps” posting on 26 July 2010. The project in question is spending $millions on a non-topic map mapping solution and has been for years.
It’s early in the year for predictions but I think this is going to be my topic maps poster-child story for 2010.
I don’t doubt that with enough effort, a topic map could be perverted to reflect the lack of sharing and coordination that is reported in this story. But if the President were to assert real control, topic maps could be a part of the solution. (My suggestion would be no sharing = no paycheck/funding. These “patriots” won’t report for work without paychecks. “Pocketbook patriotism.”)
This story illustrates the need for topic maps in three ways:
First, they could help the Washington Post offer a drill down to the actual sources and public contract information that underlies their story. Not to mention knowing which representatives got donations from the same contractors who now have contracts for national security? Can you say “merging?”
Second, rather obviously topic maps could help eliminate the extreme duplication of information flow, which would allow analysts to concentrated on less, but higher quality information. And by eliminating the duplicate information flow, that should also trim down the middle and upper level management staffs, which would increase the amount of funding that could be spend on effective intelligence activities.
Third, and perhaps less obviously, intelligence operations of other governments and governments in waiting should take a lesson in how to not run an effective intelligence operation. If you don’t have $Billions to waste on duplicated and fragmented intelligence operations, perhaps you should consider the advantages that topic maps can bring to an intelligence operation.
Those advantages vary depending on what you want but typically it would result in elimination of duplication of content, enhanced sharing between intelligence agencies, tracking of information flow, integration of data from outside sources as well as offering multiple views of the data or multi-lingual presentation.
Those advantages are not automatic. No IT system, not even topic maps, can solve personnel management issues, greed, corruption, inter-agency rivalry, sheer stupidity, etc., but assuming you can manage those, topic maps can help make intelligence operations more effective.
A reply I got to suggesting asking users about their needs:
I have never heard of an inventor making surveys to test things out. That is nonsense. At most what that can tell you is little details, ways to fine tune a system. It will never let you see the big changes coming.
The average user has at least as much imagination as would be tyrants of the WWW have arrogance, if not more.
I am going to ignore that advice and think you should as well.
Comments Off on JISC and OCLC profile the digital information seeker – Post
The convention wisdom that what evolved was Algol vs. Fortran is deeply questionable.
The underlying difficulty, a familiar one in semantic integration circles, was a universal programming language versus a diversity of programming languages.
Can you guess who won?
Can you guess where I would put my money in a repeat of a universal solution vs. diverse solutions?
For publishers, it would be possible to map responses on the basis of topics and let the topic map handle the details of where that is the appropriate response to an “opposing” app. It should shorten the update/production cycle as new material is added to counter new arguments or variations of old ones.
On the product side, publishers could use topic maps to enable users to respond to a variety of ways of naming or phrasing particular issues. In debates over religion, as in all other areas, differences in terminology can make it difficult to come to grips with the opposing side.
Depending on how it was implemented, a topic map app could integrate other resources, ranging from study materials to personal contacts as they relate to this application. Think of a topic map as being able to bridge between data held in mini-silos on an iPhone. So users could add in information into the app that was useful to them in such debates.
Any other critical points I should make as I contact publishers of these apps to recommend topic maps?
*****
PS: Did anyone with an iPhone try out tmjs from Jan Schreiber? I really don’t want to have to buy an iPhone just for that. Help me out here.
But it is only one step. True, it has reduced creation of topic maps to a drop down menu for DBpedia and Wikipedia resources, but still falls short of offering users a full-featured topic map experience.
There are a number of topic map engines, bare topic map engines. If all the reported 8.5 million developers in the world starting playing with those engines tomorrow, that is less than 1/10 of 1 percent of the 1 billion computer users in the world. My marketing department (my wife), thinks targeting promotional efforts at less than 1/10 of 1 percent of the potential audience is crazy (a technical marketing term for not good judgment).
The Mappify web service is an enormous step in the right direction.
But, the honey we need for users is demonstrating the immediate payoff without any effort on their part from this thing we call topic maps.
What to do once we have “caught” them is open to your imagination and ingenuity.
The Wikipedia article on unstructured data makes it clear that data may have a structure, but that “unstructured data” means one not readily recognizable to a computer.
The term unstructured data bothers me because any text has a structure. If it didn’t, we would not be able to read it. It would just be a jumble of symbols. Oh, sorry. Apologies to any AI agents “reading” this post. But that is how traditional computers see a text, just a jumble of symbols.
When people view a text, they see structure, recognize subjects, etc. Moreover, different people can look at the same text and see different structures and/or subjects.
There are topic maps that are written to enforce a “correct” view of a body of data and those are certainly useful in many cases. Topic maps also support users identifying the structures and subjects they see in a text, along side identifications made by others.
The extent to which users view texts and leave trails as it were of the structures and subjects they identified in a text (or body of texts), those trails form maps that can be useful to others.
Think of it as tagging but with explicit subject identity. The relationships to a particular text, its author, and a variety of other details could be extracted automatically and with a minimum of effort on the part of the user. A topic map application could even suggest subjects or associations for a user to confirm based on their reading.
Suggest: unmapped data.
Captures both the sense of exploration as well as allowing for multiple mappings.
Thoughts?
Comments Off on Unstructured Data or Unmapped Data?
Many tasks in library and information science (e.g., indexing, abstracting, classification, and text analysis techniques such as discourse and content analysis) require text meaning interpretation, and, therefore, any individual differences in interpretation are relevant and should be considered, especially for applications in which these tasks are done automatically. This article investigates individual differences in the interpretation of one aspect of text meaning that is commonly used in such automatic applications: lexical cohesion and lexical semantic relations. Experiments with 26 participants indicate an approximately 40% difference in interpretation. In total, 79, 83, and 89 lexical chains (groups of semantically related words) were analyzed in 3 texts, respectively. A major implication of this result is the possibility of modeling individual differences for individual users. Further research is suggested for different types of texts and readers than those used here, as well as similar research for different aspects of text meaning.
I won’t belabor what a 40% difference in interpretation implies for the one interpretation of data crowd. At least for those who prefer an evidence versus ideology approach to IR.
What is worth belaboring is how to use Morris’ technique to demonstrate such differences in interpretation to potential topic map customers. As a community we could develop texts for use with particular market segments, business, government, legal, finance, etc. An interface to replace the colored pencils used to mark all words belonging to a particular group. Automating some of the calculations and other operations on the resulting data.
Sensing that interpretations of texts vary is one thing. Having an actual demonstration, possibly using texts from a potential client, is quite another.
This is a tool we should build. I am willing to help. Who else is interested?
Comments Off on Demonstrating The Need For Topic Maps
A print index does not organize all the information about a subject in one location. It doesn’t even organize all the information in your personal book collection about a subject in one location. It organizes all the information in one book about a subject in one location.
We are no longer subject to that constraint.
But the question is: Without any artificial barriers, what information should go with a subject?
Example: Online maps co-locate information about hotels, convenience stores, bars, etc. with physical locations.
That is a tiny number of the subjects that we see or read about in a week. What would you like to see with those subjects?
Exercise: Every day for the next two weeks, take pencil/pen and paper around with you. At least once per day, twice if you can manage it, write down a subject you want to know more about. Without stopping to think about difficulty, expense, etc., jot down 5 pieces of information you would like to see with that subject.
Extra credit: For extra credit, rank in what order you would like to see the additional information.
Comments Off on What Information Goes With Your Subject? Exercise
The LibraryThing is the home of OverCat, a collection of 32 million library records.
It is a nifty illustration of re-using identifiers, not re-inventing them.
I put in an ISBN, for example, and the system searches for that work. It does not ask me to create a “cool” URI for it.
It also demonstrates some of the characteristics of a topic map in that it does return multiple matches for all the libraries that hold a work, but only one. (You can still view the other records as well.)
I am not sure I have the time to enter, even by ISBN, all the books that line the walls of my office but maybe I will start with the new ones as they come in and the older ones as I use them. The result is a catalog of my books, but more importantly, additional information about those works entered by others.
Maybe that could be a marketing pitch for topic maps? That topic maps enable users to coordinate their information with others, without prior agreement. Sort of like asking for a ride to town and at the same time, someone in a particular area says they are going to town but need to share gas expenses. (Treating a circumference around a set of geographic coordinates as a subject. Users neither know nor care about the details, just expressing their needs.)
Talend Reference Library offers collections of case studies and white papers to make the case for data integration.
I can’t say that I care for some of the solutions that are proffered but I am aware that having a hammer (topic maps) doesn’t mean everything I see is a nail. 😉
You do have to submit contact information to download the papers.
The papers are useful as guides on making the case for data integration (read topic maps) to management level personnel. Not too much on the technical side and always keeping a focus on issues of concern to them, costs, customer satisfaction, missed opportunities, etc.
Save the “cool” stuff for when you meet with the geeks in the IT department, after you have the contract.
Agencies uphold a “need-to-know” culture of information protection rather than promoting a “need-to-share” culture of integration. (page 417)
Fast forward seven years and we find:
[Information Sharing Environment – ISE] Gaps exist in….(3) determining the results to be achieved by the ISE (that is, how information sharing is improved) along with associated milestones, performance measures, and the individual projects. (Information Sharing [2008]
Seven years later and there are gaps in “how information sharing is improved…..”?
The power of not sharing knowledge is powerful enough to maintain data silos even in the face of national peril.
Topic maps can help you breach any silo you can access. Make that access meaningful and effective.
Not just national security data silos. Take mapping data silos of a regulated industry, say financial institutions. A mapping that grows with every audit/investigation.
Your choices are: 1) Wait for someone to relinquish power, or 2) Increase your power by breaching their data silo. Which one is for you?
I no sooner point out that the Balisage conference lacks topic maps papers than a challenge lands in my inbox.
A challenge I could not tailor more for topic maps.
Coincidence? You decide.
As part of the Balisage 2010 Conference, MarkLogic has put forth a challenge in the form of a contest. The goal of the contest is to encourage markup experts to review and to research the current state of wiki markup languages and to generate a proposal that serves to de-babelize the current state of affairs for the long haul.
Wikis: tower-of-babel Solve the modern tower of babel
Contest Description: In the past few decades, as a planet, we’ve succeeded tremendously in standardizing a number of technologies (yay us!). Wiki technology (other than its underlying use of web technologies as a platform) is not solidly in this list. There is a lot of content available today in a variety of wiki syntaces. This syntax is not standardized. Some argue it shouldn’t be. Go beyond the existing debates, diatribes, and arguments. Put us on a practical track to fixing this and ensuring we will have access to this content for the long haul.
To enter, you must propose a set of concrete steps (organizational, social, and/or technological) that will enable wiki content interchange, a real WYSIWYG editor, and/or wiki syntax standardization.
Entries will be evaluated based on criteria that includes:
* How well does the entry understand the current state of the art?
* How well does the entry identify key stake holders and actors
(including history, motivation, and so on)
* Is the entry clear on its objectives? (The summary allows for
some variance here).
* Is the approach/vision elegant, clever, or mind-changing?
* Are the set of steps actionable and implementable?
Guidelines, rules, and prize:
1. Please no more than 10000 words.
2. Entries should be submitted by July 15th to:
balisage-2010-contest at marklogic dot com
3. Author(s) retains copyright and grants MarkLogic a non-exclusive
license to publish the winning entry.
4. The winner will be announced on August 3rd at the conference and
will take home a choice of
* Apple 15″ (i5) MacBook Pro
* Apple MacBook Air or
* USD $2000
5. The winner will be strongly encouraged (but not required) to give a
brief summary (~10 minutes) of their winning entry at the conference
on August 3rd.
6. Employees of MarkLogic are not eligible.
7. Judges decision is final.
8. Contest-related questions may also be submitted to:
balisage-2010-contest at marklogic dot com.
Are you ready to take the challenge?
Comments Off on Balisage 2010 Contest – Wikis: Tower-of-Babel
Opportunities for topic maps as stand alone information products.
The Kobo eReader has 1 GB of storage standard and holds up to 1,000 titles. Topic maps for either for content navigation in general or particular books. A topic map of Jane Austen’s “Pride and Prejudice” might excite one of my college English professors, I don’t think it would be a real “hot” number in terms of sales. (Austen’s work is the default on the advertising I get at Border’s. Your display may be different.) For further information, Kobo Developer Program
Kindle (Amazon product) is another option. I would put in a link to their developer resources but all the strings have tracking information embedded in them. Just go to Amazon and follow the links to the Kindle resources. (A simple link to developer resources would be nice, just in case you know someone at Amazon.)
Or Lulu, a traditional print-on-demand/ebook publisher, has released LuLu for Developers. The LuLu company profile points out that in 2008, there were 276,489 books traditionally published in the United States. LuLu alone published 400,000 titles last year. Perhaps not every title merits a topic map but what if you created a topic map for a group of titles? That would promote sales of the titles as a group and be a value add to users.
Suppose I should also mention iPad Apps. Since I don’t have a cell phone, much less an iPhone, this one would be a really steep learning curve for me. Please post pointers to anyone developing topic maps for the iPad.
I haven’t tried one of these eformats with topic maps (yet) but suspect that once a book is “in” any of the formats, reliable pointing into them will be possible.
Imagine the “truth squads” who would want sell their “version” along side popular books. And then responses, using your topic map to reply to the first response.