September « 2012 « Another Word For It

September 3, 2012

Legal Rules, Text and Ontologies Over Time [The eternal “now?”]

Filed under: Legal Informatics,Ontology,Semantics — Patrick Durusau @ 3:06 pm

Legal Rules, Text and Ontologies Over Time by Monica Palmirani, Tommaso Ognibene and Luca Cervone.

Abstract:

The current paper presents the “Fill the gap” project that aims to design a set of XML standards for modelling legal documents in the Semantic Web over time. The goal of the project is to design an information system using XML standards able to store in an XML-native database legal resources and legal rules in an integrated way for supporting legal knowledge engineers and end-users (e.g., public administrative officers, judges, citizens).

It was refreshing to read:

The law changes over time and consequently change the rules and the ontological classes (e.g., the definition of EU citizenship changed in 2004 with the annexation of 10 new member states in the European Community). It is also fundamental to assign dates to the ontology and to the rules, , based on an analytical approach, to the text, and analyze the relationships among sets of dates. The semantic web cake recommends that content, metadata should be modelled and represented in separate and clean layers. This recommendation is not widely followed from too many XML schemas, including those in the legal domain. The layers of content and rules are often confused to pursue a short annotation syntax, or procedural performance parameters or simply because a neat analysis of the semantic and abstract components is missing.

Not being mindful of time, of the effective date of changes to laws, the dates of events/transactions, can be hazardous to your pocketbook and/or your freedom!

Does your topic map account for time or does it exist in an eternal “now?” like the WWW?

I first saw this at Legal Informatics.

Comments (1)

OASIS LegalRuleML

Filed under: Legal Informatics,LegalRuleML,RuleML — Patrick Durusau @ 2:44 pm

OASIS LegalRuleML

From the webpage:

The OASIS LegalRuleML TC defines a rule interchange language for the legal domain. The work enables modeling and reasoning that allows implementers to structure, evaluate, and compare legal arguments constructed using the rule representation tools provided.

Legal Informatics posted a notice of a new tutorial introduction to LegalRuleML.

If you are planning IT or semantic integration projects in legal circles, worth your while to take a look at LegalRuleML.

Comments Off

Google at UAI 2012

Filed under: Artificial Intelligence — Patrick Durusau @ 2:23 pm

Google at UAI 2012 by Kevin Murphy.

From the post:

The conference on Uncertainty in Artificial Intelligence (UAI) is one of the premier venues for research related to probabilistic models and reasoning under uncertainty. This year’s conference (the 28th) set several new records: the largest number of submissions (304 papers, last year 285), the largest number of participants (216, last year 191), the largest number of tutorials (4, last year 3), and the largest number of workshops (4, last year 1). We interpret this as a sign that the conference is growing, perhaps as part of the larger trend of increasing interest in machine learning and data analysis.

There were many interesting presentations. A couple of my favorites included:

“Video In Sentences Out,” by Andrei Barbu et al. This demonstrated an impressive system that is able to create grammatically correct sentences describing the objects and actions occurring in a variety of different videos.

“Exploiting Compositionality to Explore a Large Space of Model Structures,” by Roger Grosse et al. This paper (which won the Best Student Paper Award) proposed a way to view many different latent variable models for matrix decomposition – including PCA, ICA, NMF, Co-Clustering, etc. – as special cases of a general grammar. The paper then showed ways to automatically select the right kind of model for a dataset by performing greedy search over grammar productions, combined with Bayesian inference for model fitting.

You can find other individual papers at: Schedule UAI 2012.

Or you can grab the entire proceedings. (972 page PDF file)

Either way, you will find numerous items for exploration and conversation.

Comments Off

Hakim [Graphics, Visualizations]

Filed under: CSS3,Graphics,Javascript,Visualization — Patrick Durusau @ 1:58 pm

Hakim by Hakim El Hattab.

I was following a link to Reveal.js HTML Presentations Made Easy when I discovered its “parent” site.

A likely source of ideas for visualization for your data sets.

I first saw this at DZone.

Comments Off

DBMS_COMPARISON Package [Oracle]

Filed under: Oracle — Patrick Durusau @ 1:31 pm

DBMS_COMPARISON Package

Mahmoud A. El-Sayed introduces the Oracle 11g package, DBML_COMPARISON, which compares database objects, thus:

The DBMS_COMPARISON package can compare the following types of database objects:
    a- Tables
    b- Single-table views
    c- Materialized views
    d- Synonyms for tables, single-table views, and materialized views

The DBMS_COMPARISON package cannot compare data in columns of the following data types:
    a- LONG
    b- LONG RAW
    c- ROWID
    d- UROWID
    e- CLOB
    f- NCLOB
    g- BLOB
    h- BFILE
    i- User-defined types (including object types, REFs, varrays, and nested tables)
    j- Oracle-supplied types (including any types, XML types, spatial types, and media types)

You may also be interested in the Oracle documentation on DBML_COMPARISON – Oracle.

Merging presumes some comparison step so I commend this to you if you are in an Oracle environment.

Thoughts on the data type exclusions from comparison? Documentation says they are excluded but I didn’t see any hint at a reason for the exclusion.

I first saw this at DZone.

Comments Off

Small Data (200 MB up to 10 GB) [MySQL, MapReduce and Hive by the Numbers]

Filed under: Hive,MapReduce,MySQL — Patrick Durusau @ 1:09 pm

Study Stacks MySQL, MapReduce and Hive

From the post:

Many small and medium sized businesses would like to get in on the big data game but do not have the resources to implement parallel database management systems. That being the case, which relational database management system would provide small businesses the highest performance?

This question was asked and answered by Marissa Hollingsworth of Boise State University in a graduate case study that compared the performance rates of MySQL, Hadoop MapReduce, and Hive at scales no larger than nine gigabytes.

Hollingsworth also used only relational data, such as payment information, which stands to reason since anything more would require a parallel system. “This experiment,” said Hollingsworth “involved a payment history analysis which considers customer, account, and transaction data for predictive analytics.”

The case study, the full text of which can be found here, concluded that MapReduce would beat out MySQL and Hive for datasets larger than one gigabyte. As Hollingsworth wrote, “The results show that the single server MySQL solution performs best for trial sizes ranging from 200MB to 1GB, but does not scale well beyond that. MapReduce outperforms MySQL on data sets larger than 1GB and Hive outperforms MySQL on sets larger than 2GB.”

Although your friends may not admit it, some of them have small data. Or interact with clients with small data.

You print this post out and put it in their inbox. Anonymously. They will appreciate it even if they can’t acknowledge having seen it.

When thinking about data and data storage, you might want to keep the comparisons you will find at: How much is 1 byte, kilobyte, megabyte, gigabyte, etc.? in mind.

Roughly speaking, 1 GB is the equivalent of 4,473 books.

The 10 GB limit in this study is roughly 44,730 books.

Sometimes all you need is small data.

Comments Off

September 2, 2012

Discovering message flows in actor systems with the Spider Pattern

Filed under: Actor-Based,Akka,Messaging — Patrick Durusau @ 7:29 pm

Discovering message flows in actor systems with the Spider Pattern by Raymond Rostenberg.

From the post:

In this post I’m going to show a pattern that can be used to discover facts about an actor system while it is running. It can be used to understand how messages flow through the actors in the system. The main reason why I built this pattern is to understand what is going on in a running actor system that is distributed across many machines. If I can’t picture it, I can’t understand it (and I’m in good company with that quote 🙂

Building actor systems is fun but debugging them can be difficult, you mostly end up browsing through many log files on several machines to find out what’s going on. I’m sure you have browsed through logs and thought, “Hey, where did that message go?”, “Why did this message cause that effect” or “Why did this actor never get a message?”

This is where the Spider pattern comes in.

I would think the better quote would be: “If I can’t see it, I can’t understand it.” But each to their own.

Message passing systems remind me of Newcomb’s requirement for having audit trails for merging behavior.

Not necessary for every use case but when it is necessary, it is nice to know robust auditing is possible.

Or perhaps the better way to put it is that auditing is adjustable.

We can go from tracking every operation at one extreme to a middle ground of some tracking but protecting political appointees or career servants or even to a wide open system (sort of like Twitter or Facebook).

Comments Off

Entity disambiguation using semantic networks

Filed under: Entity Resolution,Graphs,Networks,Semantic Graph — Patrick Durusau @ 7:20 pm

Entity disambiguation using semantic networks by Jorge H. Román, Kevin J. Hulin, Linn M. Collins and James E. Powell. Journal of the American Society for Information Science and Technology, published 29 August 2012.

Abstract:

A major stumbling block preventing machines from understanding text is the problem of entity disambiguation. While humans find it easy to determine that a person named in one story is the same person referenced in a second story, machines rely heavily on crude heuristics such as string matching and stemming to make guesses as to whether nouns are coreferent. A key advantage that humans have over machines is the ability to mentally make connections between ideas and, based on these connections, reason how likely two entities are to be the same. Mirroring this natural thought process, we have created a prototype framework for disambiguating entities that is based on connectedness. In this article, we demonstrate it in the practical application of disambiguating authors across a large set of bibliographic records. By representing knowledge from the records as edges in a graph between a subject and an object, we believe that the problem of disambiguating entities reduces to the problem of discovering the most strongly connected nodes in a graph. The knowledge from the records comes in many different forms, such as names of people, date of publication, and themes extracted from the text of the abstract. These different types of knowledge are fused to create the graph required for disambiguation. Furthermore, the resulting graph and framework can be used for more complex operations.

To give you a sense of the author’s approach:

A semantic network is the underlying information representation chosen for the approach. The framework uses several algorithms to generate subgraphs in various dimensions. For example: a person’s name is mapped into a phonetic dimension, the abstract is mapped into a conceptual dimension, and the rest are mapped into other dimensions. To map a name into its phonetic representation, an algorithm translates the name of a person into a sequence of phonemes. Therefore, two names that are written differently but pronounced the same are considered to be the same in this dimension. The “same” qualification in one of these dimensions is then used to identify potential coreferent entities. Similarly, an algorithm for generating potential alternate spellings of a name has been used to find entities for comparison with similarly spelled names by computing word distance.
…
The hypothesis underlying our approach is that coreferent entities are strongly connected on a well-constructed graph.

Question: What if the nodes to which the coreferent entities are strongly connected are themselves ambiguous?

Comments Off

Understanding Indexing: …[Tokutek]

Filed under: Indexing,Tokutek — Patrick Durusau @ 6:34 pm

Understanding Indexing: Three rules on making indexes around queries to provide good performance (video) – slides

Tim Callaghan mentions this as coming up but here is the description from the on-demand version of this webinar:

Application performance often depends on how fast a query can respond and query performance almost always depends on good indexing. So one of the quickest and least expensive ways to increase application performance is to optimize the indexes. This talk presents three simple and effective rules on how to construct indexes around queries that result in good performance.

This is a general discussion applicable to all databases using indexes and is not specific to any particular MySQL® storage engine (e.g., InnoDB, TokuDB®, etc.). The rules are explained using a simple model that does NOT rely on understanding B-trees, Fractal Tree® indexing, or any other data structure used to store the data on disk.

Zardosht Kasheff presenting.

Comments Off

268x Query Performance Bump for MongoDB

Filed under: Fractal Trees,MongoDB,Tokutek — Patrick Durusau @ 6:24 pm

268x Query Performance Increase for MongoDB with Fractal Tree Indexes, SAY WHAT? by Tim Callaghan.

From the post:

Last week I wrote about our 10x insertion performance increase with MongoDB. We’ve continued our experimental integration of Fractal Tree® Indexes into MongoDB, adding support for clustered indexes. A clustered index stores all non-index fields as the “value” portion of the index, as opposed to a standard MongoDB index that stores a pointer to the document data. The benefit is that indexed lookups can immediately return any requested values instead of needing to do an additional lookup (and potential disk IOs) for the requested fields.

I’m trying to recover from learning about scalable subgraph matching, Efficient Subgraph Matching on Billion Node Graphs [Parallel Graph Processing], and now the nice folks at Tokutek post a 26,816% query performance increase for MongoDB.

They claim to not be MongoDB experts. I guess that’s right. The increase in performance would have been higher. 😉

Serious question: How long will it take this sort of performance increase to impact the modeling and design of information systems?

And in what way?

With high enough performance, can subject identity be modeled interactively?

Comments Off

Learning Mahout : Clustering

Filed under: Clustering,Machine Learning,Mahout — Patrick Durusau @ 6:01 pm

Learning Mahout : Clustering by Sujit Pal.

From the post:

The next section in the MIA book is Clustering. As with Recommenders, Mahout provides both in-memory and map-reduce versions of various clustering algorithms. However, unlike Recommenders, there are quite a few toolkits (like Weka or Mallet for example) which are more comprehensive than Mahout for small or medium sized datasets, so I decided to concentrate on the M/R implementations.

The full list of clustering algorithms available in Mahout at the moment can be found on its Wiki Page under the Clustering section. The ones covered in the book are K-Means, Canopy, Fuzzy K-Means, LDA and Dirichlet. All these algorithms expect data in the form of vectors, so the first step is to convert the input data into this format, a process known as vectorization. Essentially, clustering is the process of finding nearby points in n-dimensional space, where each vector represents a point in this space, and each element of a vector represents a dimension in this space.

It is important to choose the right vector format for the clustering algorithm. For example, one should use the SequentialAccessSparseVector for KMeans, sinc there is lot of sequential access in the algorithm. Other possibilities are the DenseVector and the RandomAccessSparseVector formats. The input to a clustering algorithm is a SequenceFile containing key-value pairs of {IntWritable, VectorWritable} objects. Since the implementations are given, Mahout users would spend most of their time vectorizing the input (and thinking about what feature vectors to use, of course).

Once vectorized, one can invoke the appropriate algorithm either by calling the appropriate bin/mahout subcommand from the command line, or through a program by calling the appropriate Driver’s run method. All the algorithms require the initial centroids to be provided, and the algorithm iteratively perturbes the centroids until they converge. One can either guess randomly or use the Canopy clusterer to generate the initial centroids.

Finally, the output of the clustering algorithm can be read using the Mahout cluster dumper subcommand. To check the quality, take a look at the top terms in each cluster to see how “believable” they are. Another way to measure the quality of clusters is to measure the intercluster and intracluster distances. A lower spread of intercluster and intracluster distances generally imply “good” clusters. Here is code to calculate inter-cluster distance based on code from the MIA book.

Detailed walk through of two of the four case studies in Mahout In Action. This post and the book are well worth your time.

Comments Off

Efficient Subgraph Matching on Billion Node Graphs [Parallel Graph Processing]

Filed under: Graphs,Neo4j,Networks,Trinity — Patrick Durusau @ 4:31 pm

Efficient Subgraph Matching on Billion Node Graphs by Zhao Sun (Fudan University, China), Hongzhi Wang (Harbin Institute of Technology, China), Haixun Wang (Microsoft Research Asia, China), Bin Shao (Microsoft Research Asia, China) and Jianzhong Li (Harbin Institute of Technology, China).

Abstract:

The ability to handle large scale graph data is crucial to an increasing number of applications. Much work has been dedicated to supporting basic graph operations such as subgraph matching, reachability, regular expression matching, etc. In many cases, graph indices are employed to speed up query processing. Typically, most indices require either super-linear indexing time or super-linear indexing space. Unfortunately, for very large graphs, super-linear approaches are almost always infeasible. In this paper, we study the problem of subgraph matching on billion-node graphs. We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.

Did you say you were interested in parallel graph processing?

This paper and the materials cited in the bibliography make a nice introduction to the current options for graph processing.

I first saw this at Alex Popescu’s myNoSQL, citing it from the VLDB proceedings.

With the DBLP enhanced version of the VLDB proceedings, VLDB 2012 Ice Breaker v0.1, DBLP links for the authors were easy.

Comments (1)

HTML [Lessons in Semantic Interoperability – Part 3]

Filed under: HTML,Interoperability,Semantics — Patrick Durusau @ 12:06 pm

If HTML is an example of semantic interoperability, are there parts of HTML that can be re-used for more semantic interoperability?

Some three (3) year old numbers on usage of HTML elements:

Element	Percentage
a	21.00
td	15.63
br	9.08
div	8.23
tr	8.07
img	7.12
option	4.90
li	4.48
span	3.98
table	3.15
font	2.80
b	2.32
p	1.98
input	1.79
script	1.77
strong	0.97
meta	0.95
link	0.66
ul	0.65
hr	0.37
http://webmasters.stackexchange.com/questions/11406/recent-statistics-on-html-usage-in-the-wild

Assuming they still hold true, the <a> element is by far the most popular.

Implications for a semantic interoperability solution that leverages on the <a> element?

Leave the syntax the hell alone!

As we saw in parts 1 and 2 of this series, the <a> element has:

simplicity
immediate feedback

If you don’t believe me, teach someone who doesn’t know HTML at all how to create an <a> element and verify its presence in browser. (I’ll wait.)

Back so soon? 😉

To summarize: The <a> element is simple, has immediate feedback and is in widespread use.

All of which makes it a likely candidate to leverage for semantic interoperability. But how?

And what of all the other identifiers in the world? What happens to them?

Comments Off

September 1, 2012

An Indexing Structure for Dynamic Multidimensional Data in Vector Space

Filed under: Indexing,Multidimensional,Vector Space Model (VSM) — Patrick Durusau @ 3:48 pm

An Indexing Structure for Dynamic Multidimensional Data in Vector Space by Elena Mikhaylova, Boris Novikov and Anton Volokhov. (Advances in Databases and Information Systems, Advances in Intelligent Systems and Computing, 2013, Volume 186, 185-193, DOI: 10.1007/978-3-642-32741-4_17)

Abstract:

The multidimensional k – NN (k nearest neighbors) query problem is relevant to a large variety of database applications, including information retrieval, natural language processing, and data mining. To solve it efficiently, the database needs an indexing structure that provides this kind of search. However, attempts to find an exact solution are hardly feasible in multidimensional space. In this paper, a novel indexing technique for the approximate solution of k – NN problem is described and analyzed. The construction of the indexing tree is based on clustering. Indexing structure is implemented on top of high-performance industrial DBMS.

The review of recent work is helpful but when the paper reaches the algorithm for indexing “…dynamic multidimensional data…,” it slips away from me.

Where is the dynamic nature of the data that is being overcome by the indexing?

I ask because we are human observers are untroubled by the curse of dimensionality, even when data is dynamically changing.

Although those are two important aspects when we process it by machine:

The number of dimensions of data, and
The rate at which the data is changing.

Comments Off

“What Makes Paris Look Like Paris?”

Filed under: Geo Analytics,Geographic Data,Image Processing,Image Recognition — Patrick Durusau @ 3:19 pm

“What Makes Paris Look Like Paris?” by Erwin Gianchandani.

From the post:

We all identify cities by certain attributes, such as building architecture, street signage, even the lamp posts and parking meters dotting the sidewalks. Now there’s a neat study by computer graphics researchers at Carnegie Mellon University — presented at SIGGRAPH 2012 earlier this month — that develops novel computational techniques to analyze imagery in Google Street View and identify what gives a city its character….

From the abstract:

Given a large repository of geotagged imagery, we seek to automatically find visual elements, e.g. windows, balconies, and street signs, that are most distinctive for a certain geo-spatial area, for example the city of Paris. This is a tremendously difficult task as the visual features distinguishing architectural elements of different places can be very subtle. In addition, we face a hard search problem: given all possible patches in all images, which of them are both frequently occurring and geographically informative? To address these issues, we propose to use a discriminative clustering approach able to take into account the weak geographic supervision. We show that geographically representative image elements can be discovered automatically from Google Street View imagery in a discriminative manner. We demonstrate that these elements are visually interpretable and perceptually geo-informative. The discovered visual elements can also support a variety of computational geography tasks, such as mapping architectural correspondences and influences within and across cities, finding representative elements at different geo-spatial scales, and geographically-informed image retrieval.

The video and other resources are worth the time to review/read.

What features do you rely on to “recognize” a city?

The potential to explore features within a city or between cities looks particularly promising.

Comments Off

Web Performance Power Tool: HTTP Archive (HAR)

Filed under: Interface Research/Design,Performance,Web Server — Patrick Durusau @ 2:52 pm

Web Performance Power Tool: HTTP Archive (HAR) by Ilya Grigorik.

From the post:

When it comes to analyzing page performance, the network waterfall tab of your favorite HTTP monitoring tool (e.g. Chrome Dev Tools, Firebug, Fiddler, etc) is arguably the single most useful power tool at our disposal. Now, wouldn’t it be nice if we could export the waterfall for better bug reports, performance monitoring, or later in-depth analysis?

Well, good news, that is precisely what the HTTP Archive (HAR) data format was created for. Even better, chances are, you favorite monitoring tool already knows how to speak in HAR, which opens up a lot of possibilities – let’s explore.

If you are tuning or developing a web interface, there is much here you will find helpful.

The gathering of information for later analysis, by other tools, was what interested me the most.

Comments Off

WolframAlpha Launches Personal Analytics for Facebook

Filed under: Facebook,Graphs,WolframAlpha — Patrick Durusau @ 1:23 pm

WolframAlpha Launches Personal Analytics for Facebook by Kim Rees.

From the post:

WolframAlpha has launched its Personal Analytics for Facebook [wolframalpha.com] functionality. Simply type “facebook report” into the query box, authorize the app, and view the extensive analysis of your social network. The report shows you details about when you post, what types of things you post, the apps you use, who comments the most on your posts, your most popular images, and the structure of your friend network. You can easily share or embed sections of your report.

The report is incredibly detailed. You can drill down further into most sections. Any item of significance such as names and dates can be clicked to search for more information. It was interesting to find out that I was born under a waning crescent moon (is there anything Stephen Wolfram doesn’t know?!). I don’t use Facebook much, but this service makes Facebook fun again.

How would you contrast the ease of use factor of visual drill down with the ASCII art style of Cypher in Neo4j?

What user communities would prefer one over the other?

Comments Off

Book Review – “Universal Methods of Design”

Filed under: Design,Research Methods — Patrick Durusau @ 1:10 pm

Book Review – “Universal Methods of Design” by Cyd Harrell.

From the review:

I’ve never been one to use a lot of inspirational tools, like decks of design method cards. Day to day, I figure I have a very solid understanding of core practices and can make others up if I need to. But I’ve also been the leader of a fast-paced team that has been asked to solve all kinds of difficult problems through research and design, so sticking to my personal top five techniques was never an option. After all, only the most basic real-world research goals can be attained without combining and evolving methods.

So I was quite intrigued when I received a copy of Bella Martin and Bruce Hanington’s Universal Methods of Design, which presents summaries of 100 different research and analysis methods as two-page spreads in a nice, large-format hardback. Could this be the ideal reference for a busy research team with a lot of chewy problems to solve?

In short: yes. It functions as a great reference when we hear of a method none of us is familiar with, but more importantly it’s an excellent “unsticker” when we run into a challenge in the design or analysis of a study. I have a few quibbles with organization that I’ll get to in a minute, but in general this is a book that every research team should have on hand.

See the review for Cyd’s quibble.

For a copy near you, see: “Universal Methods of Design.”

Comments Off

Neo4j-[:LOVES]->Cypher

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 12:52 pm

Neo4j-[:LOVES]->Cypher

Followup to Michael Hunger’s presentation (30 August 2012) on Neo4j and Cypher.

Slides and video have been posted, plus answers to posted questions.

Comments Off

NEH Institute Working With Text In a Digital Age

Filed under: Humanities,Text Corpus,Text Encoding Initiative (TEI),Text Mining — Patrick Durusau @ 12:37 pm

NEH Institute Working With Text In a Digital Age

From the webpage:

The goal of this demo/sample code is to provide a platform which institute participants can use to complete an exercise to create a miniature digital edition. We will use these editions as concrete examples for discussion of decisions and issues to consider when creating digital editions from TEI XML, annotations and other related resources.

Some specific items for consideration and discussion through this exercise :

Creating identifiers for your texts.

Establishing markup guidelines and best practices.

Use of inline annotations versus standoff markup.

Dealing with overlapping hierarchies.

OAC (Open Annotation Collaboration)

Leveraging annotation tools.

Applying Linked Data concepts.

Distribution formats: optimzing for display vs for enabling data reuse.

Excellent resource!

Offers a way to learn/test digital edition skills.

You can use it as a template to produce similar materials with texts of greater interest to you.

The act of encoding asks what subjects you are going to recognize and under what conditions? Good practice for topic map construction.

Not to mention that historical editions of a text have made similar, possibly differing decisions on the same text.

Topic maps are a natural way to present such choices on their own merits, as well as being able to compare and contrast those choices.

I first saw this at The banquet of the digital scholars.

Comments Off

The banquet of the digital scholars

Filed under: Humanities,Text Corpus,Text Encoding Initiative (TEI),Text Mining — Patrick Durusau @ 10:32 am

The banquet of the digital scholars

The actual workshop title: Humanities Hackathon on editing Athenaeus and on the Reinvention of the Edition in a Digital Space

September 30, 2012 Registration Deadline

October 10-12, 2012
Universität Leipzig (ULEI) & Deutsches Archäologisches Institut (DAI) Berlin

Abstract:

The University of Leipzig will host a hackathon that addresses two basic tasks. On the one hand, we will focus upon the challenges of creating a digital edition for the Greek author Athenaeus, whose work cites more than a thousand earlier sources and is one of the major sources for lost works of Greek poetry and prose. At the same time, we use the case Athenaeus to develop our understanding of to organize a truly born-digital edition, one that not only includes machine actionable citations and variant readings but also collations of multiple print editions, metrical analyses, named entity identification, linguistic features such as morphology, syntax, word sense, and co-reference analysis, and alignment between the Greek original and one or more later translations.
…

After some details:

Overview:
The Deipnosophists (Δειπνοσοφισταί, or “Banquet of the Sophists”) by Athenaeus of Naucratis is a 3rd century AD fictitious account of several banquet conversations on food, literature, and arts held in Rome by twenty-two learned men. This complex and fascinating work is not only an erudite and literary encyclopedia of a myriad of curiosities about classical antiquity, but also an invaluable collection of quotations and text re-uses of ancient authors, ranging from Homer to tragic and comic poets and lost historians. Since the large majority of the works cited by Athenaeus is nowadays lost, this compilation is a sort of reference tool for every scholar of Greek theater, poetry, historiography, botany, zoology, and many other topics.

Athenaeus’ work is a mine of thousands of quotations, but we still lack a comprehensive survey of its sources. The aim of this “humanities hackathon” is to provide a case study for drawing a spectrum of quoting habits of classical authors and their attitude to text reuse. Athenaeus, in fact, shapes a library of forgotten authors, which goes beyond the limits of a physical building and becomes an intellectual space of human knowledge. By doing so, he is both a witness of the Hellenistic bibliographical methods and a forerunner of the modern concept of hypertext, where sequential reading is substituted by hierarchical and logical connections among words and fragments of texts. Quantity, variety, and precision of Athenaeus’ citations make the Deipnosophists an excellent training ground for the development of a digital system of reference linking for primary sources. Athenaeus’ standard citation includes (a) the name of the author with additional information like ethnic origin and literary category, (b) the title of the work, and (c) the book number (e.g., Deipn. 2.71b). He often remembers the amount of papyrus scrolls of huge works (e.g., 6.229d-e; 6.249a), while distinguishing various editions of the same comedy (e.g., 1.29a; 4.171c; 6.247c; 7.299b; 9.367f) and different titles of the same work (e.g., 1.4e).

He also adds biographical information to identify homonymous authors and classify them according to literary genres, intellectual disciplines and schools (e.g., 1.13b; 6.234f; 9.387b). He provides chronological and historical indications to date authors (e.g., 10.453c; 13.599c), and he often copies the first lines of a work following a method that probably goes back to the Pinakes of Callimachus (e.g., 1.4e; 3.85f; 8.342d; 5.209f; 13.573f-574a).

Last but not least, the study of Athenaeus’ “citation system” is also a great methodological contribution to the domain of “fragmentary literature”, since one of the main concerns of this field is the relation between the fragment (quotation) and its context of transmission. Having this goal in mind, the textual analysis of the Deipnosophists will make possible to enumerate a series of recurring patterns, which include a wide typology of textual reproductions and linguistic features helpful to identify and classify hidden quotations of lost authors.

The 21st century has “big data” in the form of sensor streams and Twitter feeds, but “complex data” in the humanities pre-dates “big data” by a considerable margin.

If you are interested in being challenged by complexity and not simply the size of your data, take a closer look at this project.

Greek is a little late to be of interest to me but there are older texts that could benefit from a similar treatment.

BTW, while you are thinking about this project/text, consider how you would merge prior scholarship, digital and otherwise, with what originates here and what follows it in the decades to come.

Comments Off

HTML [Lessons in Semantic Interoperability – Part 2]

Filed under: HTML,Interoperability,Semantics,Web Server — Patrick Durusau @ 10:11 am

While writing Elli (Erlang Web Server) [Lessons in Semantic Interoperability – Part 1], I got distracted by the realization that web servers produce semantically interoperable content every day. Lots of it. For hundreds of millions of users.

My question: What makes the semantics of HTML different?

The first characteristic that came to mind was simplicity. Unlike some markup languages, ;-), HTML did not have to await the creation of WYSIWYG editors to catch on. In part I suspect because after a few minutes with it, most users (not all), could begin to author HTML documents.

Think about the last time you learned something new. What is the one thing that brings closure to the learning experience?

Feedback, knowing if your attempt at an answer is right or wrong. If right, you will attempt the same solution under similar circumstances in the future. If wrong, you will try again (hopefully).

When HTML appeared, so did primitive (in today’s terms) web browsers.

Any user learning HTML could get immediate feedback on their HTML authoring efforts.

Not:

After installing additional validation software
After debugging complex syntax or configurations
After millions of other users do the same thing
After new software appears to take advantage of it

Immediate feedback means just that immediate feedback.

The second characteristic is immediate feedback.

You can argue that such feedback was an environmental factor and not a characteristic of HTML proper.

Possibly, possibly but if such a distinction is possible and meaningful, how does it help with the design/implementation of the next successful semantic interoperability language?

I would argue by whatever means, any successful semantic interoperability language is going to include immediate feedback, however you classify it.

Comments Off

Elli (Erlang Web Server) [Lessons in Semantic Interoperability – Part 1]

Filed under: Erlang,Interoperability,Semantics,Web Server — Patrick Durusau @ 8:04 am

Elli

From the post:

My name is Knut, and I want to show you something really cool that I built to solve some problems we are facing here at Wooga.

Having several very successful social games means we have a large number of users. In a single game, they can generate around ten thousand HTTP requests per second to our backend systems. Building and operating the software required to service these games is a big challenge that sometimes requires creative solutions.

As developers at Wooga, we are responsible for the user experience. We want to make our games not only fun and enjoyable but accessible at all times. To do this we need to understand and control the software and hardware we rely on. When we see an area where we can improve the user experience, we go for it. Sometimes this means taking on ambitious projects. An example of this is Elli, a webserver which has become one of the key building blocks of our successful backends.

Having used many of the big Erlang webservers in production with great success, we still found ourselves thinking of how we could improve. We want a simple and robust core with no errors or edge cases causing problems. We need to measure the performance to help us optimize our network and user code. Most importantly, we need high performance and low CPU usage so our servers can spend their resources running our games.

I started this post about Elli to point out the advantages of having a custom web server application. If your needs aren’t meet by one of the standard ones.

Something clicked and I realized that web servers, robust and fast as well as lame and slow, churn out semantically interoperable content every day.

For hundreds of millions of users.

Rather than starting from the perspective of the “semantic interoperability” we want, why not examine the “semantic interoperability” we have already, for clues on what may or may not work to increase it?

When I say “semantic interoperability” on the web, I am speaking of the interpretation of HTML markup, the <a>, <p>, <ol>, <ul>, <div>, <h1-6>, elements that make up most pages.

What characteristics do those markup elements share that might be useful in creating more semantic interoperability?

The first characteristic is simplicity.

You don’t need a lot of semantic overhead machinery or understanding to use any of them.

A plain text editor and knowledge that some text has a general presentation is enough.

Takes a few minutes for a user to learn enough HTML to produce meaningful (to them and others) results.

At least in the case of HTML, that simplicity has lead to a form of semantic interoperability.

HTML was defined with interoperable semantics but unadopted interoperable semantics are like no interoperable semantics at all.

If HTML has simplicity of semantics, what else does it have that lead to widespread adoption?

Comments (2)

« Newer Posts

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 3, 2012

September 2, 2012

September 1, 2012