Text Mining: The Maverick Tools
Introduction
Traditionally, business intelligence has focused on analyzing data gathered from transaction processing systems, such as enterprise resource planning (ERP), customer relationship management (CRM), sales force automation (SFA), claims processing, and other structured data sources. But still there is huge amount of other unstructured data which are not used for any decision making processes. Unstructured data is abundant in most organizations but to date has not been tapped as a source of business intelligence. Free-form text, audio, and video are the most common forms of unstructured data. Recent advances in computational linguistics as well as Web and enterprise search make the integration of unstructured data into a business intelligence infrastructure feasible and effective. Together, these advances are broadly considered text mining, which is defined as analysis of natural language text to extract key terms, entities, and relationships between those terms and entities.
Core Techniques
Three core text-mining techniques are mostly used for such analysis:
• Term extraction;
• Information extraction; and
• Link analysis.
Term extraction is the most basic form of text mining. Like all text mining techniques, this one maps information from unstructured data to a structured format. Term extraction, the most basic technique, identifies key terms and logical entities, such as the names of organizations, locations, dates, and monetary amounts. The next level of complexity in text mining is information extraction. Unlike term extraction which focuses on terms, information extraction focuses on a set of facts that constitute an event, episode, or state. Information extraction builds on terms extracted from text to identify basic relationships. Link analysis combines multiple relationships to form multistep models of complex processes. Together, these three techniques provide the foundation for integrating text-based business intelligence into existing BI systems. Link analysis is a set of techniques for gaining insight into the relationships between multiple entities having multiple connections, steps, or links.