Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 19, 2011

Is Semantic Reuse A Requirement?

Filed under: Marketing — Patrick Durusau @ 7:52 pm

I was blogging about improvements at GeoCommons when I thought about my usual complaint about mapping services: There is no possibility of semantic reuse. Whatever the user thought they recognized as the basis for a mapping, is simply not recorded.

That is true for all the ETL apps like Talend, Kettle and others. They all handle mappings/transformations, all of which fail to specify identifications of the subjects recognized by the user. Simply not present. Any reuse will require another user to make an implicit mapping of subjects and their identifications and fail to record them.

Is it the case that reuse of semantic mappings may not be a requirement?

That is mappings are created as one use mappings and in the event of changes the entire mapping will have to be inspected if not re-created?

Wandora – New Release

Filed under: Topic Map Software,Wandora — Patrick Durusau @ 7:52 pm

Wandora – New Release

New Features:

  • Fixes GeoNames extractors. Now Wandora’s GeoNames extractors require a username provided by the GeoNames.
  • R console window has been rewritten.
  • Topic table allows any topic selection now. Java’s JTable component allows selections of single rows and single columns only (which is fine if you have only one column). Now Wandora overcomes the default limitation.
  • Occurrences can be duplicated (to other type). User can also change occurrence’s type.
  • We have tested Wandora on Java 7.
  • New feature: Export similarity matrix. Similarity matrix is similar to topic adjacency matrix but matrix cell contains a value representing selected similarity of row and column topics. Feature has predefined similarity measures of Subject locator similarity, Highest subject identifier similarity, Basename similarity etc.
  • Wandora’s Firefox and Thunderbird plugin works now on FF version 6.0 and TB version 6.0.

TCP Text Creation Partnership

Filed under: Concept Drift,Dataset,Language — Patrick Durusau @ 7:51 pm

TCP Text Creation Partnership

From the “mission” page:

The Text Creation Partnership’s primary objective is to produce standardized, digitally-encoded editions of early print books. This process involves a labor-intensive combination of manual keyboard entry (from digital images of the books’ original pages), the addition of digital markup (conforming to guidelines set by a text encoding standard-setting body know as the TEI), and editorial review.

The chief sources of the TCP’s digital images are database products marketed by commercial publishers. These include Proquest’s Early English Books Online (EEBO), Gale’s Eighteenth Century Collections Online (ECCO), and Readex’s Evans Early American Imprints. Idiosyncrasies in early modern typography make these collections very difficult to convert into searchable, machine-readable text using common scanning techniques (i.e., Optical Character Recognition). Through the TCP, commercial publishers and over 150 different libraries have come together to fund the conversion of these cultural heritage materials into enduring, digitally dynamic editions.

To date, the EEBO-TCP project has converted over 25,000 books. ECCO- and EVANS-TCP have converted another 7,000+ books. A second phase of EEBO-TCP production aims to convert the remaining 44,000 unique monograph titles in the EEBO corpus by 2015, and all of the TCP texts are scheduled to enter the public domain by 2020.

Several thousand titles from the 18th century collection are already available to the general public.

I mention this as a source of texts for testing search software against semantic drift. The sort of drift that occurs in any living language. To say nothing of the changing mores of our interpretation of languages with no native speakers remaining to defend them.

DiscoverText

Filed under: Data Mining,Text Analytics — Patrick Durusau @ 7:51 pm

DiscoverText

From the webpage:

DiscoverText helps you gain valuable insight about customers, products, employees, citizens, research data, and more through powerful text analytic methods. DiscoverText combines search, human judgments and inferences with automated software algorithms to create an active machine-learning loop.

DiscoverText is currently used for text analytics, market research, eDiscovery, FOIA processing, employee engagement analytics, health informatics, processing public comments by government agencies and university basic research.

Before I sign up for the free trial version, do you have any experience with this product? Suggested data sets that make it shine or not shine so much?

Introducing CorporateGroupings

Filed under: Data,Dataset,Fuzzy Matching — Patrick Durusau @ 7:51 pm

Introducing CorporateGroupings: where fuzzy concepts meet legal entities

From the webpage:

One of the key issues when you’re looking at any big company is what are the constituent parts – because these days a company of any size is pretty much never a single legal entity, but a web of companies, often spanning multiple jurisdictions.

Sometimes this is done because the company’s operations are in different territories, sometimes because the company is a conglomerate of different companies – an educational book publisher and a financial newspaper, for example. Sometimes it’s done to limit the company’s tax liability, or for other legal reasons (e.g. to benefit from a jurisdiction’s rules & regulations compared with the ‘parent’ company’s jurisdiction).

Whatever the reason, getting a handle on the constituent parts is pretty tricky, whether you’re a journalist, a campaigner, a government tax official or a competitor, and making it public is trickier still, meaning the same research is duplicated again and again. And while we may all want to ultimately surface in detail the complex cross-holdings of shareholdings between the different companies, that goal is some way off, not least because it’s not always possible to discover the shareholders of a company.

….

So you must make do with reading annual reports and trawling company registries around the world, and hoping you don’t miss any. We like to think OpenCorporates has already made this quite a bit easier, meaning that a single search for Tesco returns hundreds of results from around the world, not just those in the UK, or some other individual jurisdiction. But what about where the companies don’t include the group in the name, and how do you surface the information you’ve found for the rest of the world?

The solution to both, we think, is Corporate Groupings, a way of describing a grouping of companies without having to say exactly what legal form that relationship takes (it may be a subsidiary of a subsidiary, for example). In short, it’s what most humans (i.e. non tax-lawyers) think of when they think of a large company – whether it’s a HSBC, Halliburton or HP.

This could have legs.

Not to mention what is a separate subject to you (subsidiary) may be encompassed by a larger subject to me. Both are valid from a certain point of view.

September 18, 2011

Approaching optimality for solving SDD systems

Filed under: Algorithms,Mathematics,Matrix — Patrick Durusau @ 7:29 pm

In October 2010, this paper was presented by the authors.

Approaching optimality for solving SDD systems by Ioannis Koutis, Gary L. Miller, and Richard Peng.

Public reports on that paper can be found at: A Breakthrough in Algorithm Design in the September 2011 issue of CACM and PC Pro in: Algorithm sees massive jump in complex number crunching.

The claim is that the new approach will be a billion times faster than traditional techniques.

In February of 2011, the authors have posted a new and improved version of their algorithm in:

A nearly-mlogn time solver for SDD linear systems.

Koutis has written a MATLAB implementation at: CMG: Combinatorial Multigrid

For further background, see: Combinatorial Preconditioning, sparsification, local clustering, low-stretch trees, etc. by Spielman, one of the principal researchers in this area.

The most obvious application in topic maps would be recommender systems that bring possible merges to a topic map author’s attention or even perform merging on specified conditions. (If the application doesn’t seem obvious, read the post I refer to in: Text Feature Extraction (tf-idf) – Part 1 , again. Will also give you some ideas about scaleable merging tests as well.)

Years ago Lars Marius told me that topic maps needed to scale on laptops to be successful. It looks like algorithms are catching up to meet his requirement.

Text Feature Extraction (tf-idf) – Part 1

Filed under: Text Feature Extraction,Vector Space Model (VSM) — Patrick Durusau @ 7:29 pm

Text Feature Extraction (tf-idf) – Part 1 by Christian Perone.

To give you a taste of the post:

Short introduction to Vector Space Model (VSM)

In information retrieval or text mining, the term frequency – inverse document frequency also called tf-idf, is a well know method to evaluate how important is a word in a document. tf-idf are also a very interesting way to convert the textual representation of information into a Vector Space Model (VSM), or into sparse features, we’ll discuss more about it later, but first, let’s try to understand what is tf-idf and the VSM.

VSM has a very confusing past, see for example the paper The most influential paper Gerard Salton Never Wrote that explains the history behind the ghost cited paper which in fact never existed; in sum, VSM is an algebraic model representing textual information as a vector, the components of this vector could represent the importance of a term (tf–idf) or even the absence or presence (Bag of Words) of it in a document; it is important to note that the classical VSM proposed by Salton incorporates local and global parameters/information (in a sense that it uses both the isolated term being analyzed as well the entire collection of documents). VSM, interpreted in a lato sensu, is a space where text is represented as a vector of numbers instead of its original string textual representation; the VSM represents the features extracted from the document.

The link to the The most influential paper Gerard Salton Never Wrote fails. Try the cached copy at CiteSeer: The most influential paper Gerard Salton Never Wrote.

Recommended reading.

Terrastore

Filed under: Javascript,MapReduce,NoSQL,Terrastore — Patrick Durusau @ 7:28 pm

Terrastore

From the webpage:

Terrastore, being based on a rock solid technology such as Terracotta, will focus more and more on advanced features and extensibility. Right now, Terrastore provides support for:

  • Custom data partitioning.
  • Event processing.
  • Push-down predicates.
  • Range queries.
  • Map/Reduce querying and processing.
  • Server-side update functions.

terrastore-0.8.2-dist.zip was just released.

This new version comes with several bug fixes and rock solid stability (at least, we hope so 😉 , other than a few important enhancements and new features such as:

  • Update to Terracotta 3.5.2 with performance improvements and reduced memory consumption.
  • Bulk operations.
  • Improved Javascript integration, with the possibility to dynamically load Javascript functions from files to use in server-side updates and map-reduce processing.

The Map/Reduce querying and processing is of obvious interest for topic map applications.

Neo4j Pitchfilm

Filed under: Marketing,Neo4j — Patrick Durusau @ 7:28 pm

Neo4j Pitchfilm

15 second film to pitch Neo4j.

Does this remind you of any technology?

Functional Data Structures – Chris Okasaki Publications

Filed under: Data Structures,Functional Programming — Patrick Durusau @ 7:28 pm

Functional Data Structures – Chris Okasaki Publications

I was trying to find a paper that Daniel Spiewak mentions in: Extreme Cleverness: Functional Data Structures in Scala when I ran across this listing of publications by Chris Okasaki.

Facing the choice of burying the reference in what seems like an endless list of bookmarks or putting it in my blog where I may find it again and/or it may benefit someone else, I chose the latter course.

Enjoy.

Tinkerpop Stack Releases

Filed under: Blueprints,Frames,Gremlin,Rexster — Patrick Durusau @ 7:28 pm

Marko Rodriguez announced a new round of Tinkerpop Stack Releases today:

The TinkerPop stack went through another round of releases this morning.

  • Blueprints 1.0 (Blueprints): = https://github.com/tinkerpop/blueprints/wiki/Release-Notes
  • Pipes 0.8 (Cleaner): = https://github.com/tinkerpop/pipes/wiki/Release-Notes
  • Frames 0.5 (Beams): = https://github.com/tinkerpop/frames/wiki/Release-Notes
  • Gremlin 1.3 (On the Case): = https://github.com/tinkerpop/gremlin/wiki/Release-Notes
  • Rexster 0.6 (Dalmatian): = https://github.com/tinkerpop/rexster/wiki/Release-Notes
    • Rexster-Kibbles 0.6 = http://rexster-kibbles.tinkerpop.com

For those using Gremlin, Pipes, and Rexster, be sure to look through the release notes as APIs have changed slightly. Here are the main points of this release:

  • Blueprints now has transaction buffers and Neo4jBatchGraph for bulk loading a Neo4j graph.
  • Pipes makes use of FluentPipeline and PipeFunction which yields great expressivity and further opens up the framework to other JVM languages.
  • Gremlin is ~2.5x faster in many situations and has relegated most of its functionality to Pipes and native Java.
  • Rexster supports Neo4j High Availability and more updates to its REST API.

Extreme Cleverness: Functional Data Structures in Scala

Filed under: Functional Programming,Scala — Patrick Durusau @ 7:27 pm

Extreme Cleverness: Functional Data Structures in Scala by Daniel Spiewak.

Daniel is an enthusiastic and engaging speaker.

The graphics are particularly helpful.

The influence of chip architecture on the usefulness of data structures was interesting.

All the code, etc., at: http://www.github.com/djspiewak/extreme-cleverness

September 17, 2011

The Revolution(s) Are Being Televised

Filed under: Crowd Sourcing,Image Recognition,Image Understanding,Marketing — Patrick Durusau @ 8:17 pm

Revolutions usually mean human rights violations, lots of them.

Patrick Meier has a project to collect evidence of mass human rights violations in Syria.

See: Help Crowdsource Satellite Imagery Analysis for Syria: Building a Library of Evidence

Topic maps are an ideal solution to link objects in dated satellite images to eye witness accounts, captured military documents, ground photos, news accounts and other information.

I say that for two reasons:

First, with a topic map you can start from any linked object in a photo, a witness account, ground photo or news account and see all related evidence for that location. Granted that takes someone authoring that collation but it doesn’t have to be only one someone.

Second, topic maps offer parallel subject processing, which can distribute the authoring task in a crowd-sourced project, for instance. For example, I could be doing photo analysis and marking the location of military checkpoints. That would generate topics and associations for the geographic location, the type of installation, dates (from the photos), etc. Someone else could be interviewing witnesses and taking their testimony. As part of the processing of that testimony, another volunteer codes an approximate date and geographic location in connection with part of that testimony. Still another person is coding military orders by identified individuals for checkpoints that include the one in question. Associations between all these separately encoded bits of evidence, each unknown to the individual volunteers becomes a mouse-click away from coming to the attention of anyone reviewing the evidence. And determining responsibility.

The alternative, the one most commonly used, is to have an under-staffed international group piece together the best evidence it can from a sea of documents, photos, witness accounts, etc. An adequate job for the resources they have, but why settle for an “adequate” job when it can be done properly with 21st century technology?

Open Data Tools

Filed under: Data Mining,Visualization — Patrick Durusau @ 8:14 pm

Open Data Tools

Not much in the way of tools, yet, but is a site worth watching.

I remain uneasy about the emphasis on tools for “open data.” Anyone can use tools to manipulate “open data,” but if you don’t know the semantics of the data, the results are problematic.

We Feel Fine

Filed under: Data Mining,Semantics — Patrick Durusau @ 8:14 pm

We Feel Fine – An exploration of human emotions, in six movements.

From the “mission” page:

Since August 2005, We Feel Fine has been harvesting human feelings from a large number of weblogs. Every few minutes, the system searches the world’s newly posted blog entries for occurrences of the phrases “I feel” and “I am feeling”. When it finds such a phrase, it records the full sentence, up to the period, and identifies the “feeling” expressed in that sentence (e.g. sad, happy, depressed, etc.). Because blogs are structured in largely standard ways, the age, gender, and geographical location of the author can often be extracted and saved along with the sentence, as can the local weather conditions at the time the sentence was written. All of this information is saved.

The result is a database of several million human feelings, increasing by 15,000 – 20,000 new feelings per day. Using a series of playful interfaces, the feelings can be searched and sorted across a number of demographic slices, offering responses to specific questions like: do Europeans feel sad more often than Americans? Do women feel fat more often than men? Does rainy weather affect how we feel? What are the most representative feelings of female New Yorkers in their 20s? What do people feel right now in Baghdad? What were people feeling on Valentine’s Day? Which are the happiest cities in the world? The saddest? And so on.

The interface to this data is a self-organizing particle system, where each particle represents a single feeling posted by a single individual. The particles’ properties – color, size, shape, opacity – indicate the nature of the feeling inside, and any particle can be clicked to reveal the full sentence or photograph it contains. The particles careen wildly around the screen until asked to self-organize along any number of axes, expressing various pictures of human emotion. We Feel Fine paints these pictures in six formal movements titled: Madness, Murmurs, Montage, Mobs, Metrics, and Mounds.

At its core, We Feel Fine is an artwork authored by everyone. It will grow and change as we grow and change, reflecting what’s on our blogs, what’s in our hearts, what’s in our minds. We hope it makes the world seem a little smaller, and we hope it helps people see beauty in the everyday ups and downs of life.

I mention this as an interesting data set and possible approach to discovering the semantic range in the use of particular terms.

Clearly we use a common enough vocabulary for Google and similar applications to be useful to most people a large part of the time. But they fail with alarming regularly and without warning as well. And therein lies the rub. How do I know that the information in the first ten (10) hits is the most important information about my query? Or even relevant, without hand examining each hit? To say nothing of the “hits” at 100+ and beyond.

The “problem” terms are going to vary by domain but I am curious if identification of domains, along with use of domain based vocabularies, might improve searches, at least of professional literature. Thinking there are norms of usage in professional literature that may make it a “special” case. Perhaps most of the searches of interest to enterprise searchers are “special” cases in some sense of the word.

GRASS: Geographic Resources Analysis Support System

GRASS: Geographic Resources Analysis Support System

The post about satellite imagery analysis for Syria made me curious about tools for use for automated analysis of satellite images.

From the webpage:

Commonly referred to as GRASS, this is free Geographic Information System (GIS) software used for geospatial data management and analysis, image processing, graphics/maps production, spatial modeling, and visualization. GRASS is currently used in academic and commercial settings around the world, as well as by many governmental agencies and environmental consulting companies. GRASS is an official project of the Open Source Geospatial Foundation.

You may also want to visit the Open Dragon project.

From the Open Dragon site:

Availability of good software for teaching Remote Sensing and GIS has always been a problem. Commercial software, no matter how good a discount is offered, remains expensive for a developing country, cannot be distributed to students, and may not be appropriate for education. Home-grown and university-sourced software lacks long-term support and the needed usability and robustness engineering.

The OpenDragon Project was established in the Department of Computer Engineering of KMUTT in December of 2004. The primary objective of this project is to develop, enhance, and maintain a high-quality, commercial-grade software package for remote sensing and GIS analysis that can be distributed free to educational organizations within Thailand. This package, OpenDragon, is based on the Version 5 of the commercial Dragon/ips® software developed and marketed by Goldin-Rudahl Systems, Inc.

As of 2010, Goldin-Rudahl Systems has agreed that the Open Dragon software, based on Dragon version 5, will be open source for non-commercial use. The software source code should be available on this server by early 2011.

And there is always the commercial side, if you have funding ArcGIS. The makers of ArcGIS, Esri support a several open source GIS projects.

The results of using these or other software packages can be tied to other information using topic maps.

Faster Approximate Pattern Matching in Compressed Repetitive Texts

Filed under: Pattern Matching — Patrick Durusau @ 8:13 pm

Faster Approximate Pattern Matching in Compressed Repetitive Texts by Travis Gagie, Pawel Gawrychowski, and Simon J. Puglisi.

Abstract:

Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with $r$ rules for a string $s$ of length $n$, we can build an \Oh\({r}\)-word data structure that allows us to extract any substring \(s [i..j]\) in $\Oh{\log n + j – i}$ time. They also showed how, given a pattern $p$ of length $m$ and an edit distance (k \leq m), their data structure supports finding all \occ approximate matches to $p$ in $s$ in $\Oh{r (\min (m k, k^4 + m) + \log n) + \occ}$ time. Rytter (2003) and Charikar et al. (2005) showed that $r$ is always at least the number $z$ of phrases in the LZ77 parse of $s$, and gave algorithms for building straight-line programs with $\Oh{z \log n}$ rules. In this paper we give a simple $\Oh{z \log n}$-word data structure that takes the same time for substring extraction but only $\Oh{z (\min (m k, k^4 + m)) + \occ}$ time for approximate pattern matching.

It occurs to me that this could be useful for bioinformatic applications of topic maps that map between data sets and literature.

Interesting that the authors mention redundancy in web crawls. I suspect that there are a number of domains that have highly repetitive data should we choose to look at them that way. Markup documents for example.

Nvidia Research

Filed under: GPU — Patrick Durusau @ 8:12 pm

Nvidia Research

Nvidia has a number of programs for working with academic institutions and researchers. I got an email today extolling several new research centers mostly with projects in the hard sciences.

Please spread the call for research projects with GPUs in the difficult sciences. Social sciences and humanities in general.

For example, consider the analysis in How to kill a patent with Python. You discover two very different words for the same thing. You use topics to record they represent the same subject and the graphic display changes in real time to add or subtract patents of interest. And to add or subtract relationships to other patents, patent holders, parties of interest, non-patent literature, etc. Dynamic analysis where your insights change and evolve as you explore the patents. With the ability to roll-back to any point in your journey.

That is the power of very high-end processing and GPUs, such as those from Nvidia, are one way to get there.

BTW, there is an awesome collection of materials from academics already available at this location.

You could be the first person in your department/institution to publish on topic maps using Nvidia GPUs!

Got Hadoop?

Filed under: Bioinformatics,Biomedical,Hadoop — Patrick Durusau @ 8:12 pm

Got Hadoop?

This is going to require free registration at Genomeweb but I think it will be worth it. (Genomeweb also offers $premium content but I haven’t tried any of it, yet.)

Nice overview of Hadoop in genome research.

Annoying in that it lists the following projects, sans hyperlinks. I have supplied the project listing with hyperlinks, just in case you are interested in Hadoop and genome research.

Crossbow: Whole genome resequencing analysis; SNP genotyping from short reads
Contrail: De novo assembly from short sequencing reads
Myrna: Ultrafast short read alignment and differential gene expression from large RNA-seq eakRanger: Cloud-enabled peak caller for ChIP-seq data
Quake: Quality-aware detection and sequencing error correction tool
BlastReduce: High-performance short read mapping (superceded by CloudBurst)
CloudBLAST*: Hadoop implementation of NCBI’s Blast
MrsRF: Algorithm for analyzing large evolutionary trees

*CloudBLAST was the only project without a webpage or similar source of information. This is a paper, perhaps the original paper on the technique. Searching for any of these techniques reveals a wealth of material on using Hadoop in bioinformatics.

Topic maps can capture your path through data (think of bread crumbs or string). So when today you think, “I should have gone left, rather than right”, you can retrace your steps and take a another path. Try that with a Google search. If you are lucky, you may get the same ads. 😉

You can also share your bread crumbs or string with others, but that is a story for another time.

September 16, 2011

Strata 2011 Live Video Stream

Filed under: BigData,Conferences,Data — Patrick Durusau @ 6:43 pm

Strata 2011 Live Video Stream

From the webpage:

In case you don’t have the luck to be in New York around this time, but want to get a glimpse at what’s happening at the Strata Conference listen up: O’Reilly kindly provides live broadcasts from keynotes, talks and workshops. You can see the full schedule of broadcasts here: http://datavis.ch/oBT4EO.

Strata doesn’t ring a bell in your head? It’s one of the biggest conferences focused on data and the business around it organized by O’Reilly.

Strata Conference covers the latest and best tools and technologies for this new discipline, along the entire data supply chain—from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively. With hardcore technical sessions on parallel computing, machine learning, and interactive visualizations; case studies from finance, media, healthcare, and technology; and provocative reports from experts and innovators, Strata Conference showcases the people, tools, and technologies that make data work.

This is the other reason I buy O’Reilly publications.

What’s new in Apache Solr 3.4(?)

Filed under: Lucene,Solr — Patrick Durusau @ 6:42 pm

What’s new in Apache Solr 3.4: New Programmer’s Guide now available

From the post:

Yesterday’s announcement of the release of Solr 3.4 brings with it a host of welcome improvements that make search-related applications more powerful, faster, and easier to build. We’ve put together a new
Programmer’s Guide to Open Source Search Search: What’s New in Apache Solr / Lucene 3.4 with details on what this new release holds for you, both in terms of what’s under the hood, new usability and user experience features, as well as new search capabilities:

This paper covers innovations including:

  • New search capabilities such as query support, function queries, analysis, input and output formats.
  • Performance improvements such as index segment management and distributed support for spellchecking.
  • New search application development options such as better range faceting and a new Velocity-driven search UI, plus spatial search and using Apache UIMA.
  • What to expect in Solr 4

Be sure to check out the annotators that link to services such as OpenCalias. (page 20 of the whitepaper) Won’t be perfect but certainly do well enough (with your assistance) to be useful.

Open Textbooks – Computer Science

Filed under: CS Lectures — Patrick Durusau @ 6:42 pm

Open Textbooks – Computer Science

From the “about” page:

The Community College Open Textbooks Collaborative is funded by The William and Flora Hewlett Foundation. This collection of sixteen educational non-profit and for-profit organizations, affiliated with more than 200 colleges, is focused on driving awareness and adoptions of open textbooks to more than 2000 community and other two-year colleges. This includes providing training for instructors adopting open resources, peer reviews of open textbooks, and mentoring online professional networks that support for authors opening their resources, and other services.

….
College Open Textbooks has peer-reviewed more than 100 open textbooks for use in community college courses and identified more than 550: College Open Textbooks has already peer-reviewed several new open textbooks for use in community college courses and identified more than 250 others for consideration. Open textbooks are freely available for use without restriction and can be downloaded or printed from web sites and repositories.

There is a respectable listing of works under computer science.

This is a resource to use, recommend to others, and to support by contributing open textbooks.

Active Learning for Node Classification in Assortative and Disassortative Networks

Filed under: Clustering,Networks — Patrick Durusau @ 6:42 pm

Active Learning for Node Classification in Assortative and Disassortative Networks by Cristopher Moore, Xiaoran Yan, Yaojia Zhu, Jean-Baptiste Rouquier, and Terran Lane.

Abstract:

In many real-world networks, nodes have class labels, attributes, or variables that affect the network’s topology. If the topology of the network is known but the labels of the nodes are hidden, we would like to select a small subset of nodes such that, if we knew their labels, we could accurately predict the labels of all the other nodes. We develop an active learning algorithm for this problem which uses information-theoretic techniques to choose which nodes to explore. We test our algorithm on networks from three different domains: a social network, a network of English words that appear adjacently in a novel, and a marine food web. Our algorithm makes no initial assumptions about how the groups connect, and performs well even when faced with quite general types of network structure. In particular, we do not assume that nodes of the same class are more likely to be connected to each other—only that they connect to the rest of the network in similar ways.

If abstract doesn’t recommend this paper as weekend reading, perhaps the following quote from the paper will:

our focus is on the discovery of functional communities in the network, and our underlying generative model is designed around the assumption of that these communities exist.

You will recall from Don’t Trust Your Instincts that we are likely to see what we expect to see in text, or in this case, networks. Not that using this approach frees us from introducing bias, but it does insure the observer bias is uniformly applied across the data set. Which may lead to results that startle us, interest us or that we consider to be spurious. In any event, this is one more approach to test and possibly illuminate our understanding of a network.

PS: Are communities the equivalent of clusters?

Information Bridge

Filed under: Information Retrieval,Library — Patrick Durusau @ 6:41 pm

Information Bridge

From the webpage:

The Information Bridge: DOE Scientific and Technical Information provides free public access to over 282,000 full-text documents and bibliographic citations of Department of Energy (DOE) research report literature. Documents are primarily from 1991 forward and were produced by DOE, the DOE contractor community, and/or DOE grantees. Legacy documents are added as they become available in electronic format.

The Information Bridge contains documents and citations in physics, chemistry, materials, biology, environmental sciences, energy technologies, engineering, computer and information science, renewable energy, and other topics of interest related to DOE’s mission.

Another important source of US government funded research on information retrieval.

Learning Topic Models by Belief Propagation

Filed under: Bayesian Models,Latent Dirichlet Allocation (LDA) — Patrick Durusau @ 6:41 pm

Learning Topic Models by Belief Propagation by Jia Zeng, William K. Cheung, and Jiming Liu.

Abstract:

Latent Dirichlet allocation (LDA) is an important class of hierarchical Bayesian models for probabilistic topic modeling, which attracts worldwide interests and touches many important applications in text mining, computer vision and computational biology. This paper proposes a novel tree-structured factor graph representation for LDA within the Markov random field (MRF) framework, which enables the classic belief propagation (BP) algorithm for exact inference and parameter estimation. Although two commonly-used approximation inference methods, such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great successes in learning LDA, the proposed BP is competitive in both speed and accuracy validated by encouraging experimental results on four large-scale document data sets. Furthermore, the BP algorithm has the potential to become a generic learning scheme for variants of LDA-based topic models. To this end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic models (RTM), using belief propagation based on the factor graph representation.

I have just started reading this paper but wanted to bring it to your attention. I peeked at the results and it looks quite promising.

This work was tested against the following data sets:

1) CORA [30] contains abstracts from the CORA research paper search engine in machine learning area, where the documents can be classified into 7 major categories.

2) MEDL [31] contains abstracts from the MEDLINE biomedical paper search engine, where the documents fall broadly into 4 categories.

3) NIPS [32] includes papers from the conference “Neural Information Processing Systems”, where all papers are grouped into 13 categories. NIPS has no citation link information.

4) BLOG [33] contains a collection of political blogs on the subject of American politics in the year 2008. where all blogs can be broadly classified into 6 categories. BLOG has no author information.

with positive results.

Topic Modeling Bibliography

Filed under: Latent Dirichlet Allocation (LDA),Topic Models (LDA) — Patrick Durusau @ 6:40 pm

Topic Modeling Bibliography

An extensive bibliography on topic modeling (LDA) by David Mimno.

There are a number of related resources on his homepage.

Scientific and Technical Information (STI)

Filed under: CS Lectures,Library — Patrick Durusau @ 6:40 pm

Scientific and Technical Information (STI)

From the “about” page:

STI (scientific and technical information) is the collected set of facts, analyses, and conclusions resulting from scientific, technical, and related engineering research and development efforts, both basic and applied.

That has to be a classic as far as non-helpful explanations. 😉

Or you can try:

This site helps you locate, obtain, and publish NASA aerospace information and find national and international information pertinent to your research and mission.

A little better.

Access publicly available NASA and NACA reports, conference papers, journal articles, and more. Includes over a quarter-million full-text documents, and links to more than a half-million images and video clips.

Better still.

And then:

NTRS promotes the dissemination of NASA STI to the widest audience possible by allowing NTRS information to be harvested by sites using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAI-PMH defines a mechanism for information technology systems to exchange citation information using the open standards HTTP (Hypertext Transport Protocol) and XML (Extensible Markup Language). NTRS is designed to accept and respond to automated requests using OAI-PMH. Automated requests only harvest citation information and not the full-text document images.

Which means you can populate your topic map with data from this source quite easily.

EECS Technical Reports UC Berkeley

Filed under: CS Lectures,Library — Patrick Durusau @ 6:39 pm

EECS Technical Reports UC Berkeley

From the webpage:

The EECS Technical Memorandum Series provides a dated archive of EECS research. It includes Ph.D. theses and master’s reports as well as technical documents that complement traditional publication media such as journals. For example, technical reports may document work in progress, early versions of results that are eventually published in more traditional media, and supplemental information such as long proofs, software documentation, code listings, or elaborated examples.

Technical reports listed here include the EECS Technical Report series (started in October 2005), the CS Technical Report series (from 1982 to 2005), and the ERL Technical report series (from 1984 to 2005, plus selected titles from before 1984). Full text is included for the EECS and CS series, but not for the ERL series. In the case of the ERL series, full text may be available on other web sites (such as the personal web pages of the authors).

Development at the Speed and Scale of Google

Filed under: Computer Science,Dependency,Software — Patrick Durusau @ 6:38 pm

Development at the Speed and Scale of Google by Ashish Kumar.

Interesting overview of development at Google. I included it as a background for the question:

How would you use topic maps as part of documenting a development process?

Or perhaps better: Are you using topic maps as part of a development process and if so, how?

Now that I think about it, there may be another way to approach the use of topic map in software engineering. Harvest the bug reports and push those through text processing tools. I haven’t ever thought of bug reports as a genre but I suspect it has all the earmarks of one.

Thoughts? Comments?

Building Data Science Teams

Filed under: Data,Data Analysis — Patrick Durusau @ 6:38 pm

Building Data Science Teams -The Skills, Tools, and Perspectives Behind Great Data Science Groups by DJ Patil.

From page 1:

Given how important data science has grown, it’s important to think about what data scientists add to an organization, how they fit in, and how to hire and build effective data science teams.

Nothing you probably haven’t heard before but a reminder isn’t a bad thing.

The tools to manipulate data are becoming commonplace. What remains and will remain elusive, will be the skills to use those tools well.

« Newer PostsOlder Posts »

Powered by WordPress