Recording of Content through Automatic Indexing
Due to the increasing flood of information in our digital era, it is becoming more and more difficult for users to orientate themselves or to find specific information in volumes of heterogeneous data (BIG DATA) that are becoming larger and larger.
Due to the incredibly fast pace of the modern world, time pressure for high-quality, intellectual research is on the rise as well. At the same time, the technical requirements of media companies – such as publishers, news agencies, online portals, public broadcasters and private television companies – are increasing in order to find high-quality content in audio, video and text data that stands out from the masses.
The dio:semantic Categorizer indexes documents of any language and taxonomy. The combination of statistical methods with core linguistic principles guarantees very high cataloging quality.
The dio:semantic Categorizer is not limited to one specific taxonomy, rather it supports any number of hierarchical category systems. In addition, multiple models can be managed and controlled synchronously, which makes parallel indexing possible from different documentary perspectives. Through the synthesis of statistical and linguistic methods, the Categorizer achieves a tremendous throughput rate of up to 200 documents (length of an average press article) per second and outstanding cataloging quality. The high-quality of the procedure was documented for the first time in the context of the “Media Rank” Fraunhofer test, and confirmed through use in the ARD/PAN environment in 2011.
Entities (for example persons, organizations, geographic terms,…) are identified in documents by means of entity recognition. Together with clustering, this provides considerable support for fast research.
Named Entity Recognition (aka NER) supplies all the entities that appear in a document. Through an easy to understand rule group that can be parameterized, NER can differentiate between central, relevant entities and those that are mentioned casually. In addition, a refined filtering mechanism makes it possible to resolve ambiguous entities (disambiguation).
Automatic language recognition identifies the most important European languages in the document in question. If the document contains passages with different languages, then all languages can be recognized.
Based on a statistical knowledge base, the automatic language identification can automatically and exactly ascertain German, English, French and Spanish. The language is recognized accurately even in text with many foreign words or very specific terminology. The underlying statistical approach makes it possible to expand the language recognition up to 180 languages with very little effort.
Related Topics and Duplicates
The semantic system provides comparative values between one document and another through analyzing thematic overlaps or commonalities in the formulation.
The similarities index contains information on the thematic relationship between the documents. Related topics include text that is formulated in different ways, but deals with the same issues. The duplicates index contains information on structural commonalities between the documents and is used for finding duplicates.
All documents can be grouped into clusters completely automatically. Thematic structuring makes it possible to collect and screen large volumes of digital content.
The dio:semantic clustering procedure structures documents in a thematic way. To do this, the documents can be drawn from different sources. Each cluster represents a group of documents that have a certain amount of thematic overlapping (similarity). This overlapping of content can also be viewed as a main topic of a cluster. Volumes of documents preprocessed that way can be collected and catalogued more easily, accurately and faster.
The dio:semantic Summarizer creates text summaries fully automatically with the relevant key statements of a document reduced to a predefined amount of text.
Abstracts at the Press of a Button
With the semantic system, you can shorten any text to a predefined length and still retain the most important information. For example, you can make a one-page management abstract out of an extensive press release at the press of a button, or you can make a fact sheet out of technical documentation.
Keywords and Tag Cloud
dio:semantic determines the central keywords of a document and thereby provides the perfect prerequisite for modern tag cloud visualization and automatic indexing.
dio:semantic Keyword Extraction determines the relevant, meaningful terms of text, i.e. the so-called keywords. They describe a single document or a volume of documents in the most compact form.
Analysis in Video & Audio Data
Through the partnership with EML, the experts in the field of Speech-2-Text (audio transcription), dio:semantic can offer its customers the semantic repertoire with additional information for video and audio data as well.
Conventional advertising systems are highly dependent on existing metadata and its quality. Our solution can directly record the plain text of any media content and is therefore completely independent of existing metadata.
On the basis of our innovative audio and video analysis system, we can identify relevant information in Web videos, podcasts or WebTV fully automatically. Conventional advertising systems are highly dependent on existing metadata and its quality. Our solution can record the plain text of any media content and is therefore completely independent of existing metadata. This new metadata can be used in your company in a variety of ways:
- for context-based advertising control, synchronous to the playback of the content
- for precision analysis of podcasts or Web videos
- for enhancement of existing user-generated metadata
dio:semantic completes the semantic repertoire through the possibility of making it possible to accurately find all previously analyzed and indexed data that has been enhanced with metadata.
The intelligent search strategies guarantee media-type-overriding, relevant hits that are presented in a very clearly laid out Web user interface. The intuitive handling of the search user interface invites the user to conduct targeted research in large volumes of data and makes it possible to refine the search results with just a few mouse clicks.
On these pages, you will find out what performance and individual adaptations are possible with the semantic system. If you would like additional explanations of these current and real implementations, then please send us a message.
The dio:semantic Training Client completes the semantic portfolio in the area of data maintenance. With this desktop application, the knowledge base can be conveniently managed, maintained and optimized.
The Training Client is a platform-independent Java desktop application. It provides all the functionality for convenient management, maintenance and optimization of the statistical knowledge bases. Comprehensive statistical and analytical procedures make it possible to have very in-depth insight into the mode of operation of the dio:semantic Categorizer.