Text Analytics Tools and Runtime for IBM LanguageWare
From the website:
IBM LanguageWare is a technology which provides a full range of text analysis functions. It is used extensively throughout the IBM product suite and is successfully deployed in solutions which focus on mining facts from large repositories of text. With support for more than 20 languages, LanguageWare is the ideal solution for extracting the value locked up in unstructured text information and exposing it to business applications. With the emerging importance of Business Intelligence and the explosion in text-based information, the need to exploit this “hidden” information has never been so great. LanguageWare technology not only provides the functionality to address this need, it also makes it easier than ever to create, manage and deploy analysis engines and their resources.
It comprises Java libraries with a large set of features and the linguistic resources that supplement them. It also comprises an easy-to-use Eclipse-based development environment for building custom text analysis applications. In a few clicks, it is possible to create and deploy UIMA (Unstructured Information Management Architecture) annotators that perform everything from simple dictionary lookups to more sophisticated syntactic and semantic analysis of texts using dictionaries, rules and ontologies.
The LanguageWare libraries provide the following non-exhaustive list of features: dictionary look-up and fuzzy look-up, lexical analysis, language identification, spelling correction, hyphenation, normalization, part-of-speech disambiguation, syntactic parsing, semantic analysis, facts/entities extraction and relationship extraction. For more details see the documentation.
The LanguageWare Resource Workbench provides a complete development environment for the building and customization of dictionaries, rules, ontologies and associated UIMA annotators. This environment removes the need for specialist knowledge of the underlying technologies of natural language processing or UIMA. In doing so, it allows the user to focus on the concepts and relationships of interest, and to develop analyzers which extract them from text without having to write any code. The resulting application code is wrapped as UIMA annotators, which can be seamlessly plugged into any application that is UIMA-compliant.
IBM has attracted a lot of attention with its Jeopardy playing “Watson,” and that isn’t necessarily a bad thing.
Personally I am hopeful that it will spur a greater interest in both the humanities as well as CS. Humanities because CS in its absence lacks a lot of interesting problems and CS because that can result in software for the rest of us to use.
Many years ago, before CS became professional or at least as professional as it is now, there was a healthy mixture of math, engineering, humanists and what would become computer scientists in computer science projects.
This software package may be a good way to attract a better cross-section of people to a project.
Not sure if finding others for collaboration will be easier in a university setting (with sharp department lines) or in a public setting where people may be looking for projects outside of work in the public interest.
Possible project questions:
- Define a project where you would use these text analytic tools. (3-5 pages, no citations)
- What other disciplines would you involve and how would you persuade them to participate? (3-5 pages, no citations)
- How would you involve topic maps in your project and why? (3-5 pages, no citations)
- How would you use these tools to populate your topic maps? (5-7 pages, no citations)