MEERKAT Digital Documents Reverse Indexing and Semantic Search in Big Data Repositories of UBITECH LTD
Price of the Product
Characteristics of the Product
- Type Software
In specific the platform will handle the automatic or semi-automatic processing of structural or non-structural digital documents/files, a process which comprises of three stages:
Stage 1 – Data and metadata extraction: Refers to the processing of the initial format of digital files in order to extract the essential text and the metadata of each document. By the term essential text we define the text in simple format (plain text) which can be extracted from any structural or non-structural digital file. However, any additional digital data in other format such as sound,image,comment tags are not within the scope of use of the MEERKAT system, and thus are not processed by the system. As metadata we define the plain text elements which are attached or are embedded to a digital file. The metadata which are extracted during the first stage of processing consist of a closed-list which includes all fields of the extracted metadata, accompanied by the respective value of each field.
Stage 2 – Named-Entity recognition (NER): During this stage, the named entity recognition process of semantic entities take place on the essential text which has been extracted during the first stage of process. The named-entity recognition process is achieved by the two independent, individual named entity extraction subsystems of the primary NER system. A comprehensive description of the two NER subsystems follows:
Named-entity recognition based on statistical rules: The specific subsystem achieves the extraction of named-entities within the essential text based on statistical rules and pre-trained probabilistic entity models. Thus, based on these pre-trained models, the subsystem recognizes and categorizes the entities which have been identified within the essential text. In addition, the models and the procedures of NER are based on the max entropy algorithm which is a widely, well used algorithm, and it is considered to be the most accurate and suitable by the experts in natural language processing topics such as part-of-speech tagging (POS), sentence detection, relationship extraction, sentiment analysis and many other.
Named-entity recognition based on vocabulary: As opposed to the previous subsystem, the current one is not based on pre-trained entity models nor does it make use of machine learning algorithms. It achieves the named-entity recognition task by tokenizing the essential text into words and then passing each one through a filter which extracts the stem of each word. After this process a customized analyzer tries to achieve the exact match between the stemmed word and the words included in all the available vocabularies which are fed to the system.
Stage 3 – Data Indexing: Through this final stage, the processing, analysis and storage of the data to it’s final format, is achieved. More specifically, in this stage all the essential information which has been extracted from the primitive source of data, is stored to a persistent location as a specific structural format of data, including all the indexes that have been defined during this process. After this process, the data is now in a form upon which subject indexing can be achieved.
Supported Search Operations and Semantic Queries
Once the aforementioned technical implementation and deployment stages have been successfully completed, the end-user is able to achieve a set of searches and semantic queries on the processed digital data, through the dedicated user interface of the semantically enriched search engine that MEERKAT provides. In specific, ubi:indexer supports the end-user with the following operations:
- Simple-Basic Search: Search which is based only on the input of the end-user which usually is raw text.
- Faceted Search: This kind of search gives the ability to the end-user to achieve faceted searching on preferred topics in which the data are categorized based on the indexes which define them.
- Advanced Search: In this search , the results are pre-processed among with the users’ preferences. In more detail, the end-user can filter the search results based on the criteria of interests. Indicative criteria may be: ascending/descending sorting, range filtering on a result set, field filtering, statistical or complex mathematical function filtering, semantic distance between words, grouping by max/min/mean value and other.
- Complex Search: This kind of search includes a combination of all the aforementioned search types.
- Result Highlighting
- Spelling Check
- Multi-Format Result Exporting (XML/XSLT, JSON, Python, Ruby, PHP, Velocity, CSV, binary)
- End-users’ Permission Search Control
- Spelling auto-suggestion functionality on users’ queries
- Auto-complete functionality on users’ queries
- History Search Record
- Automatic Save of recent search results
- Performance Optimization
- Geospatial Search
- Adaptive Similarity Model per search field
- Adaptation and Modification of Special and Lexicon lists (synonym lists, protected list, named-entity lists, stop-word lists)
The MEERKAT solution was adopted for the implementation of the " Administrative and Juridical Digital Services of the Greek Police" project funded by the Operational Programme "Digital Convergence" of the Ministry of Infrastructure, Transport and Network of the Greek Republic, wherein UBITECH has been assigned with the responsibility for implementing the reverse indexing, semantic enrichment and complex search utilities of the Greek Police Big Digital Archive. In particular, UBITECH has realized a multilingual, scalable, linguistic and semantic-based search engine for the Greek Police that offers reverse indexing, semantic enrichment and semantic search services to large amounts of unstructured, semi-structured or totally structured digital documents and data of the Greek Police ( more than five million files resulting in more than fifty million digital documents and more than 140Tb of data). Performing semantic analysis with the help of knowledge models and text mining techniques, the proposed subsystem enriches the indexed digital documents and the queries to be executed with semantic information that will be derived by ontological models specifically developed for the Greek Police.
Category of the Product
5 products - of UBITECH LTD
XCHANGE e-Delivery Hup: A Platform for the Secure Electronic Exchange of Documents and (Un)Structured Data
Description of the Product
MEERKAT handles morphological variations, synonyms, context awareness, generalizations, concept matching, semantic matching and natural language queries, while MEERKAT allows end-users to enter their questions freely without the need of special formats and operators.
MEERKAT provides to end-users all the necessary tools to be able to complete a set of tasks based on natural language processing technologies, having as the ultimate goal the realization of a set of both semantic and complex searches, as well as expressive queries, on large datasets. Those tasks are fundamental in order to ensure the development of advanced services for text processing.