Efficient continuous multi-query processing over data streams

Ζερβάκης, Ελευθέριος

dc.contributor.advisor	Τρυφωνόπουλος, Χρήστος
dc.contributor.author	Ζερβάκης, Ελευθέριος
dc.date.accessioned	2020-05-11T06:26:11Z
dc.date.issued	2019-07-10
dc.identifier.uri	http://amitos.library.uop.gr/xmlui/handle/123456789/5559
dc.description	Δ.Δ. 9	el
dc.description.abstract	Στη σύγχρονη ψηφιακή εποχή, η δημιουργία και η διάθεση νέας πληροφορίας γίνεται με ταχείς ρυθμούς. Η επιλεκτική διάχυση πληροφορίας (information dissemination, publish/subscribe) έχει αναπτυχθεί ως το μέσο για την διευκόλυνση της αναζήτησης και έγκαιρης διάδοσης πληροφορίας στους χρήστες, καθώς και της ανακάλυψης νέου και ενδιαφέροντος περιεχομένου. Τα τελευταία χρόνια, η επιστημονική έρευνα στον τομέα της διάχυσης πληροφορίας έχει επικεντρωθεί στην αναπαράσταση των ενδιαφερόντων των χρηστών που εκφράζο-νται μέσω της δημιουργίας προφίλ (π.χ., εγγραφές σε υπηρεσίες παροχής ειδήσεων, δημιουργία προφίλ σε κοινωνικά δίκτυα κ.λ.π.) και στην αποτελεσματική και γρήγορη διανομή της πληροφορίας στους χρήστες, όταν αυτή γίνει διαθέσιμη. Ο τεράστιος όγκος δεδομένων όμως που γίνεται διαθέσιμος καθημερινά στον Παγκόσμιο Ιστό απαιτεί αποτελεσματικούς αλγόριθμους τόσο για την αναπαράσταση και ευρετηρίαση των προφίλ (profile creation, profile indexing), όσο και για το φιλτράρισμα της νέας διαθέσιμης πληροφορίας (publication filtering, information dissemination, mutli-query processing). Η παρούσα διατριβή στοχεύει στην επίλυση των παραπάνω προ-βλημάτων χρησιμοποιώντας σύγχρονες μορφές αναπαράστασης δεδομένων (RDF data, graph data), και προτείνοντας δομές δεδομένων και αλγόριθμους για την διαχείριση του μεγάλου όγκου πληροφορίας. Η παρούσα έρευνα μελέτησε λύσεις ευρετηρίασης και φιλτραρίσματος πληροφορίας κειμένου βασισμένες σε δεντρικές δομές (trie-based profile indexing), σχεδίασε και ανέπτυξε αλγορίθμους για την ευρετηρίαση δεδομένων μεγάλου όγκου που έχουν ληφθεί από μια πληθώρα συλλογών κειμένων. Οι προτεινόμενοι αλγόριθμοι αξιο-λογήθηκαν πειραματικά και τα αποτελέσματα που προκύπτουν από την αξιολόγηση υποδεικνύουν βελτίωση έως και δυο τάξεις μεγέθους σε σύγκριση με υπάρχουσες λύσεις της βιβλιογραφίας. Τα αποτελέσματα της έρευνας μας επισημαίνουν ως καίριο αράγοντα βελτιστοποίησης της αποτελεσματικής απόδοσης του φιλτραρίσματος τις δεντρικές δομές. Πιο συγκεκριμένα, τα αποτελέσματα υποδεικνύουν ότι η μορφολογία και οργάνωση των δεντρικών δομών είναι ο καθοριστικός παράγοντας βελτιστοποίη-σης, σε αντίθεση με την μέχρι έως τώρα πεποίθηση ότι το μέγεθος των δεντρικών δομών (forest compactness) αποτελεί τον κύριο παράγοντα απόδοσης. Σε συνέχεια της παρούσας έρευνας, σχεδιάσθηκαν και αναπτύχθηκαν αλγορίθμοι για την ευρετηρίαση και το φιλτράρισμα δεδομένων που αναπαριστώνται στο μοντέλο δεδομένων RDF. Επιπρόσθετα, προτείναμε μια καινοτόμα επέκταση της γλώσσας ερωτήσεων SPARQL, η οποία στοχεύει στην αύξηση της εκφραστικότητας των ε-ρωτήσεων των χρηστών μέσω της παροχής τελεστών κειμένου (full-text operators). Οι αλγόριθμοι που σχεδιάστηκαν και αναπτύχθηκαν αξιολογήθηκαν πειραματικά, και τα αποτελέσματα που προκύπτουν από την αξιολόγηση υποδεικνύουν βελτίωση έως και δύο τάξεις μεγέθους σε σύγκριση με υπάρχουσες καινοτόμες λύσεις της βιβλιογραφίας. Επιπλέον, η έρευνα μας στόχευσε στη σχεδίαση και ανάπτυξη αλγορίθμων για την ευρετηρίαση και την αξιολόγηση ερωτήσεων σε ροές δεδομένων για γράφους. Η παρούσα έρευνα είναι η πρώτη στη βιβλιογραφία η οποία εισάγει την συνεχή αξιολόγη-ση πολλαπλών ερωτήσεων (mutli-query processing) πάνω από ροές δεδομένων για γράφους. Πιο συγκεκριμένα, σχεδιάσαμε και αναπτύξαμε τέσσερις νέους αλγορίθμους με σκοπό την μελέτη και αξιολόγηση της απόδοσης διαφορετικών προσεγγίσεων ευρετηρίασης προφίλ. Η αξιολόγηση στόχευσε στην εκτίμηση της απόδοσης των αλγορίθμων σε ένα ευρύ πεδίο εφαρμογών, όπως τα κοινωνικά δίκτυα (Social Net-works), τα δίκτυα κίνησης οχημάτων σε αστικά κέντρα (Road Networks), και οι γράφοι αλληλεπιδράσεων πρωτεϊνών (Protein-to-Protein Interaction Graphs), και στην αξιολόγηση και στην σύγκριση των σχεδιασθέντων αλγορίθμων με υπάρχουσες εμπορικές λύσεις. Τα αποτελέσματα της πειραματικής αξιολόγησης τονίζουν την ανάγκη για ανάπτυξη εξιδεικευμένων λύσεων σχεδιασμένων για συνεχή αξιολόγηση ερωτήσεων σε ροές δεδομένων γράφων, καθώς παρατηρήθηκε βελτίωση του χρόνου φιλτραρίσματος κατά δυο τάξεις μεγέθους ανάμεσα στους προτεινόμενους αλγόριθ-μους και στις πιο απλοϊκές προσεγγίσεις. Τέλος, η έρευνα μας επικεντρώθηκε στην σχεδίαση και ανάπτυξη ενός καινοτόμου, πλήρως λειτουργικού, συστήματος φιλτραρίσματος πληροφορίας κειμένου, με την ο-ομασία Ping. Η ανάπτυξη του συστήματος Ping στόχευσε στη μελέτη υπαρχόντων τεχνολογικών λύσεων υπό το φως της διάχυσης πληροφορίας, και στη δημιουργία ενός πλήρως λειτουργικού συστήματος παροχής υπηρεσιών φιλτραρίσματος για τους χρήστες. Η δημιουργία ενός τέτοιου συστήματος αναδεικνύει την εφαρμοσιμότητα προηγμένων τεχνολογικών λύσεων στον τομέα της διάχυσης πληροφορίας.	el
dc.format.extent	σελ. 255	el
dc.language.iso	en	el
dc.publisher	Πανεπιστήμιο Πελοποννήσου	el
dc.rights	Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/gr/	*
dc.title	Efficient continuous multi-query processing over data streams	el
dc.title.alternative	Αποτελεσματική συνεχής επεξεργασία πολλαπλών ερωτημάτων μέσω ροών δεδομένων	el
dc.type	Διδακτορική διατριβή	el
dc.contributor.committee	Σκιαδόπουλος, Σπύρος
dc.contributor.committee	Βασιλάκης, Κώστας
dc.contributor.department	Τμήμα Πληροφορικής και Τηλεπικοινωνιών	el
dc.contributor.faculty	Σχολή Οικονομίας, Διοίκησης και Πληροφορικής	el
dc.description.abstracttranslated	In the modern digital era, the creation and availability of new information has increased exponentially. A plethora of information sources, such as news delivery sites, knowledge bases, and social networks, constantly make new content available at an overwhelming pace. To assist users in coping with the vast amount of newly generated information the Information Filtering (IF) paradigm was introduced. IF applications aim at assisting users in information discovery and enable users to cope with the information avalanche and the cognitive overload associated with it. In an IF scenario, users or services, express their information needs (implicitly or explicitly) through appropriate interfaces, tools and languages and submit profiles (or continuous queries) to a system or service. In this way, users create subscriptions that are continuously matched (by the system or service) against a stream of newly published content, and generate notifications whenever new items that match users’ information needs are published. The filtering problem is of high importance and needs to be solved efficiently, since servers are expected to handle millions of queries and high rates of incoming content. In our work, we examine the research problem of developing efficient and effective algorithms that are able to capture the nature of information streams through the form of continuous multi-query answering. To this end, we choose to explore solutions under the domains of textual information filtering, ontology publish/subscribe systems and evolving graph stream environments. Finally, we design, implement and present a fully-functional information filtering system that showcases the usefulness of the IF paradigm and provides the basis for developers to build added-value IF services in a number of different information domains. At first we examine the information filtering paradigm, under the scope of textual information filtering while employing the Boolean data model. In this setup clients subscribe to a server with continuous queries that express their information needs and get notified every time appropriate information is published. To perform this task in an efficient way, servers employ indexing schemes that support fast matches of the incoming information with the query database. However, state-of-the-art indexing schemes are sensitive to the query insertion order and cannot adapt to an evolving query workload, degrading the filtering performance over time. In this line of work, we present an adaptive trie-based algorithm that outperforms current methods by relying on query statistics to reorganize the query database. In our research, we explore query database reorganization techniques and demonstrate that the nature of the constructed tries, rather than their compactness, is the determining factor for efficient filtering performance. Our algorithm does not depend on the order of insertion of queries in the database, manages to cluster queries even when clustering possibilities are limited, and achieves two orders of magnitude filtering time improvement over its state-of-the-art competitors. Finally, we demonstrate that our solution is easily extensible to multi-core machines by providing an implementation in a multi-core environment. In the continuation of our work, we investigate publish/subscribe ontology systems; we envision a publish/subscribe ontology system that is able to index large numbers of expressive continuous queries and filter them against RDF data that arrive in a streaming fashion. To this end, we propose a SPARQL extension that supports the creation of full-text continuous queries and propose a family of main-memory query indexing algorithms which perform matching at low complexity and minimal filtering time. We experimentally compare our approach against a state-of-the-art competitor (extended to handle indexing of full-text queries) both on structural and full-text tasks using real-world data. The experimental results demonstrate that our approach yields two orders of magnitude faster performance than the competitor in all types of filtering tasks. Subsequently, in our research we study the domain of evolving graphs that have a wide range of applications involving social networks, knowledge bases and biological interactions. The evolution of a graph in such scenarios can yield important insights about the nature and activities of the underlying network, which can then be utilized for applications such as news dissemination, network monitoring, and content curation. Capturing the continuous evolution of a graph can be achieved by long-standing sub-graph queries. Although, for many applications this can only be achieved by a set of queries, state-of-the-art approaches focus on a single query scenario. Therefore, in this line of work, we introduce the notion of continuous multi-query processing over graph streams and discuss its application to a number multi-query evaluation against a stream of graph updates and experimentally demonstrated its applicability. Our results against three baseline approaches and the graph database Neo4j, using real-world and synthetic atasets, confirm a two orders of magnitude improvement of the proposed solution. Finally, we conclude our research with the design and development of a fullyfledged textual information filtering system coined Ping. The Ping IF system is a fully-functional content-based information filtering system aiming (i) to showcase the realizability of information filtering and (ii) to explore and test the suitability of the existing technological arsenal for information filtering tasks. The proposed system is entirely based upon open-source tools and components, is customizable enough to be adapted for different textual information filtering tasks, and puts emphasis in user profile expressivity, intuitive UIs, and timely information delivery. To assess the customizability of Ping, we deployed it in two distinct information filtering scenarios, while to assess its performance we designed and conducted a series of experiments for both scenarios.	el
dcterms.embargoTerms	3 years	el
dcterms.embargoLiftDate	2023-05-11T06:26:11Z

Αρχεία σε αυτό το τεκμήριο

Όνομα:: zervakis-thesis-v11.pdf
Μέγεθος:: 3.602Mb
Τύπος:: PDF

Προβολή/Άνοιγμα

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Τμήμα Πληροφορικής και Τηλεπικοινωνιών (Δ. Δ.)

Εμφάνιση απλής εγγραφής

Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα

Εκτός από όπου επισημαίνεται κάτι διαφορετικό, το τεκμήριο διανέμεται με την ακόλουθη άδεια:
Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα