NSF- III: Small: On the
Conceptual Evaluation and Optimization of Queries in Spatiotemporal Data
Systems
Walid G. Aref
Principal Investigator
Department of Computer Science
Purdue University
_
Contact Information
Walid G.
Aref
Department
of Computer Science
Purdue
University
305 N.
University Street
West Lafayette, Indiana
47907
Phone: (765) 494-1997
Fax : (765) 494-0739
Email: aref@cs.purdue.edu
URL: http://www.cs.purdue.edu/faculty/aref.html
Project
Award Information
Project Web Page:
http://www.cs.purdue.edu/homes/aref/MPS
Project Focus:
The widespread use of smart-phones,
social networks, and location devices has given rise to a wide spectrum of
location-based applications and services. Along with the advancement and
ubiquity of location-based applications and services, users' queries and needs
are consistently growing in complexity and are becoming very sophisticated.
Current location-based query systems offer shallow query interface capabilities
that do not match users' expectations. This project identifies and addresses
the challenges in correctly expressing and efficiently executing complex
location-based queries. A user's query can get arbitrarily complex having a
combination of multiple location-based, relational, and textual operations,
e.g., location-based selections, spatial joins, relational selections, relational
joins, relational group-by's, etc. Current
location-based query systems support only one location-based operator, e.g.,
range queries or nearby queries (k-Nearest-Neighbors, or kNN,
for short). Our focus is to develop a query system, termed the multi-predicate
location-based query processing system, that supports
any combination of location-based, relational, and textual operations.
Project Goals:
Spatial data management systems have become vital due to the ubiquity
of location-detection devices. The challenge is not only in supporting a massive number of moving
objects that generate spatiotemporal data streams, but also in supporting a
massive number of location-based continuous queries. With location-based services
becoming widespread and popular, location-based queries have gone beyond the
simple single-predicate range and k-nearest-neighbor queries. Location-based
services demand complex multi-predicate spatial (MPS, for short) queries.
Although a large body of research has been devoted to continuous spatial query
processing, surprisingly none addresses the processing of MPS queries. In
contrast to relational queries, MPS queries may produce different results
depending on the order in which their predicates are evaluated. This results in
the ambiguity of the intended semantics of the MPS queries and hence would
hamper portability and consistency of query executions across platforms. In
this case, different implementations of MPS queries would result in different
answers to the query. This project addresses this key limitation as well as
related optimization issues.
This project has
research and education goals. The research goal of the project is to develop a
database query processing system, termed the multi-predicate location-based
query processing system, that supports any combination
of location-based and relational operations that match the needs and complexity
of location-aware services. More specifically, the specific research goals of
the project are as follows:
1.
Study MPS
queries and propose a consistent and correct conceptual evaluation strategy
similar to the one for relational SQL queries.
2.
Based on this
foundation that addresses the correctness of MPS query execution, the target is
to develop various algebraic transformation rules that retain correctness but
help transform MPS query execution plans into more efficient ones.
3.
Investigate
the validity of well-known optimization heuristics in the context of MPS
queries with the aim to identify several potential optimization scenarios
unique to MPS queries.
4.
Develop a cost
model for query optimization purposes to select the best query execution plans
for MPS queries.
5.
Study the
impact of the variation on temporal and spatial object distributions on MPS query
execution plans.
6.
Develop new
adaptive spatiotemporal query processing techniques that can cope with the
dynamic nature and scale of moving object environments.
This project has a
significant educational component. The target is to address low retention of
computer science freshman by designing more interesting entry-level programming
languages that go beyond the classical ``Hello World" programming examples
which have shown their limits. The specific educational and training goals of
this project are as follows.
1.
Design a map-centeric programming language and programming enviorment that is centered around
2D and 3D map operations (e.g., using 3D Earth and Map APIs) to increase the
interest of students in programming, both in CS and in Science.
2.
Graduate and
undergraduate student training that will result in graduating PhD students and
the involvement of undergraduate students in research projects.
Major activities,
highlights, and findings of the project are presented below.
Summary of
Project Activities and Highlights:
2011-2012:
In the initial phase of this project in
2011/2012, we were able to identify location-based application scenarios and
their associated queries that current state-of-the-art location-based query
systems cannot handle. We were able to isolate, identify, and classify the
reasons and ingredients for what makes these queries hard to handle. In our
VLDB 2012 paper, we identified the first such class of queries, namely the
class of 2-kNN queries, where queries in this class contain two
k-nearest-neighbor operators. In the VLDB paper, we not only illustrated that
many application scenarios call for queries in this class, but also identified
challenges that arise when handling this class of 2-kNN queries. These
challenges include:
1.
Depending
on the order of execution of the two kNN
operations, the result of a 2-kNN query will differ.
2.
Correctness
Challenge: How to determine what the correct answer for a 2-kNN query should be
and on what bases?
3.
Efficiency
Challenge: How to efficiently execute a 2-kNN query while preserving
correctness?
4. Query Optimization Challenge: How to
adapt well-known relational query optimization strategies (e.g., pushing
selects under joins in query evaluation plans) to still be
applicable in the context of 2-kNN queries? Our VLDB 2012 paper addressed all
four challenges.
2012-2013:
In 2012/2013, our research focused on
expanding the capabilities of multi-predicate location-based query servers
beyond the class of 2-kNN queries along several dimensions. We explored various
combinations of relational and spatial operations (Journal Papers: GeoInformatica 2013, VLDBJ 2013). We studied crowd-based
database indexing techniques (DBCrowd 2013), and
semantic web location-based linked data architectures (IEEE BigData
2013).
2013-2014:
In 2013/2014, we studied query optimization issues of queries that
involve both spatial and relational predicates. We developed a novel cost
estimation model for kNN operations when executing in
conjunction with relational operators. We extended our work to trajectories
with limited histories (SIGSPATIAL 2014). We studied adaptivity
of query execution plans in continuous streaming queries (EDBT 2014). We
extended our work to "approximate" nearest-neighbor computations in
the context of high-dimensional data, and proximity-based intersect operators
(SISAP 2014). We completed and published our work on dependency models when
human actions are part of a database transaction (TKDE 2014).
2014-2015:
In 2014/2015, we completed an important milestone in supporting
multi-predicate location-based query processing systems. We studied the various
semantics of post- and pre-filter for k-nearest-neighbor operators and their
interactions with relational operators, and the various algorithmic
optimizations that lead to orders of magnitude enhancement in performance (ACM
SIGSPATIAL 2015). We developed an accurate cost estimation model that helps
arbitrate among these optimizations and decide which one to apply at what time
(EDBT 2015a). We studied how to efficiently process Hamming-Distance-based
similarity-search queries using similarity hashing techniques both in a
centralized and distributed setups (EDBT 2015b).
2015-2016:
In 2015/2016, we studied proximity-based group-by operations (IEEE
TKDE 2016). Our study on proximity-based set intersection operation (SISAP
2014) received the best-paper award and was invited for journal publication.
The journal version (Information Systems 2016) covesr
all proximity-based set operators including the proximity-based set intersect,
set difference, and set union operators. Also, we studied how to use semantics
over web data to learn how to geo-tag non-spatial concepts (e.g., crime,
pollution, or education) and show potential locations of these non-spatial
concepts over the map (MobiGIS 2015). We
developed Tornado; a prototype system for scalable
organization and query processing of Big Spatial Data. Tornado (VLDB 2015 demo)
addresses the needs of micro-blogging applications where text data is often
associated with location data, e.g., as in tweets and check-in systems. Tornado
focuses on distributed query processing of online streamed
spatio-textual data. We explored how the workload
awareness techniques that we have developed for big spatial data can be
extended to support range queries over general streaming data (WSDM 2016). We
continued to develop the LIMO map-centric programming language environment (ACM
SIGSPATIAL 2015 demo) that helps freshmen in Computer Science learn programming
concepts using maps. Through the development of the LIMO system, we realized
that programming environments centered around maps is
a potentially important topic that deserves further research. Our vision paper
(ACM SIGSPATIAL 2015 vision paper) about map-centric programming environments
highlights the research challenges and open questions in the area. We adapted
spatial indexing and spatial query processing techniques to apply in the
context of access control and preserving privacy over data streams [TKDE 2015].
Along with collaborators Drs. Shashi Shekhar and Steven Feiner, Aref
published two overview and visionary articles on the future of Spatial
Computing (GeoInformatica 2015 and Communications of
the ACM 2016). The CACM 2016 article was featured as the main cover page for
CACM. Aref, Mokbel, and Chow won the ten-year best
paper award at the VLDB 2016 conference for their 2006 VLDB paper titled: The
New Casper: Query Processing for Location Services without Compromising
Privacy. Mohamed F. Mokbel, Chi-Yin Chow, Walid G.
Aref. VLDB 2006 pp. 763-774, for inspiring research to
support privacy in spatial query processing. This award has been
accompanied by a presentation at the 2016 VLDB conference (after ten years of
the original Casper paper). We investigated adaptive query processing
techniques and adaptive caching for big spatial data in in-memory distributed
spatial data systems (ICDE 2016). We developed a flexible framework for online
record linkage and fusion (ICDE 2016). Finally, we developed, EDP, a graph indexing
technique to compute shortest-paths over any subset of the graph (SIGMOD 2016).
EDP allows graph updates and sub-graph selections based on categorical
attributes.
2016-2017:
In 2016/2017, my group developed two prototypes, AQWA and LocationSpark, towards scalable organization and query
processing of Big Spatial Data. AQWA focuses on dynamic and online
query-workload-aware spatial data organization that achieves up to two orders
of magnitude enhancement in performance over other static and existing big
spatial data platforms (VLDB 2015 demo and VLDB 2016 full paper). In contrast, LocationSpark is a distributed in-memory data management
system for big spatial data that uses advanced spatial filtering
techniques. We designed a spatial version of a bloom filter and
demonstrated its use in LocationSpark (VLDB 2016
demo). We studied spatial trending queries over real-time microblogs
(SIGSPATIAL 2016). We designed a new query language, Atlas, to express all
spatial-keyword queries (SIGSPATIAL 2016). We follow the same spirit as the
relational query language SQL. In Atlas, we introduce new simple building-block
constructs (similar to SQL’s selects, projects, and joins) that can be composed
together to express all types of spatial-keyword queries. We surveyed shortest-path
algorithms (Corr 2017). We developed in-memory
distributed matrix computation processing and optimization techniques for use
with large scale graph processing algorithms (ICDE
2017). Walid and one of his Ph.D. students presented a tutorial on the subject
of query processing techniques for big spatial-keyword data at the SIGMOD 2017
conference. Finally, Walid gave a keynote speech with the title: "Location+X Big Data Systems: Challenges and Some
Solutions" at the 2017 SIGMOD GeoRich Workshop.
Project Accomplishments, Significant Results, and Outcomes:
The research
and education goals of the project have all been achieved. We have published
over 30 journal and refereed conference articles. More
specifically, we published 16 journal papers in top-tier journal venues
including the IEEE Transactions on Knowledge and Data Engineering, The VLDB
Journal, Communications of the ACM, GeoInformatica,
Information Systems, Proceedings of the VLDB Journal, and the IEEE Software
magazine, as well as 21 papers in top-tier conference venues including ACM
SIGMOD, VLDB, IEEE ICDE, EDBT, ACM SIGSPATIAL, SISAP, and WSDM, and one
upcoming tutorial at the ACM SIGMOD (on the subject of Spatial-keyword Query
Processing Techniques). Moreover, our research in the scope of this project was
awarded the following best-paper awards:
-- 2016 VLDB
Endowment Award: VLDB
10-year Best Paper Award for inspiring research to support privacy in spatial
query processing. This award has been accompanied by a presentation at the 2016
VLDB conference titled " Location Data Management: A Tale of Two Systems
and the "Next Destination"! Mohamed F. Mokbel,
Chi-Yin Chow, Walid G. Aref. PVLDB 9(13): 1622 (2016).
-- 2015 Best
Demonstration Award,
ACM SIGSPATIAL 2015, for our system, LIMO, a map-centric programming language
environment. Our LIMO system is hosted at: http://ibnkhaldun.cs.purdue.edu:8181/limo/GWTMaps.html with sample LIMO programs.
-- 2014 Best
Paper Award in
the International Conference on Similarity Search and Applications (SISAP) 2014
for our paper Wadha J. Al Marri,
Qutaibah M. Malluhi, Mourad Ouzzani, MingJie Tang, Walid G. Aref: The Similarity-Aware
Relational Intersect Database Operator. SISAP 2014: 164-175, that was later
extended and published in the Information Systems Journal (2016).
-- Two Ph.D.
students have graduated based on the research conducted within the scope of
this project.
-- LIMO (ACM
SIGSPATIAL 2015 Demo and Vision Papers):
We finished the
development of the LIMO map-centric programming language and programming
environment.
LIMO can be
found at: http://ibnkhaldun.cs.purdue.edu:8181/limo/GWTMaps.html
LIMO was
initially designed to serve as a map-centric programming language that uses 2D
and 3D maps as the programming "Toy" in contrast to using robots or animations.
LIMO provides high-level programming contructs that
are easy to use and that help a user navigate a 2D or 3D map. The LIMO website
contains 16 sample LIMO programs that demonstrate how to use the LIMO commands
to interactively navigate the 2D or the 3D map that is displayed on the right.
In addition, users can create their own accounts and store their own LIMO
programs in their hosted accounts at our machine. LIMO is not open for public
use as of yet. However, after a demo of LIMO at ACM SIGSPATIAL 2015 that
received best demo award, and a vision paper about map-centric programming
language at the same SIGSPATIAL conference, LIMO was very well received and
several researchers highlighted the need for using LIMO not only as an
entry-level programming language but also as a tool for spatial computing.
While the goal of LIMO as a programming language is simplicity of use, the goal
of LIMO as a spatial computing tool is to be comprehensive in functionality. The use of LIMO for spatial computing is the
subject of a future study.
-- Survey,
Overview, Tutorial, and Vision Articles on Multi-Predicate Spatial Querying and
Related Topics (SIGMOD'17, CACM'16, GeoInformatica'15, SIGSPATIAL'15, IEEE
Software Magazine'12)
One of the
important outcomes of this project is a collection of survey and overview
articles that are published in prestigious outlets including:
1.
An
article about Spatial Computing that was featured in the cover of the widely
distributed Communications of the ACM in 2016.
2.
An
article titled: "From GPS and virtual globes to spatial computing"
that was published in the Springer Verlag GeoInformatica Journal 2015.
3.
An
article titled: "On map-centric programming environments: vision
paper" that was published in the Vision Papers section at ACM SIGSPATIAL
2015.
4.
A
tutorial titled: "Query Processing Techniques for Big Spatial-Keyword
Data" that will be covered in ACM SIGMOD 2017.
5.
An
overview article about Distributed Access Control Architectures for Cloud
Computing that was published in the IEEE Software Magazine 2012.
--
Multi-predicate Spatial Query Processing and Optimization (VLDB'12, VLDBJ'13,
SIGSPATIAL'15, EDBT'15)
With this
series of papers, we have addressed all the ingredients necessary for building
a multi-predicate spatial query processing engine.
In our VLDB'12
paper, we presented the first complete study for the optimization of queries
with two k-nearest-neighbor (kNN) predicates. We
demonstrated how traditional optimization techniques can
compromise the correctness of evaluation for a query that involves two
interacting kNN predicates. For different
combinations of two kNN predicates, we presented
efficient algorithms that guarantee the correctness of evaluation, and
outperform the corresponding conceptually correct QEPs by orders of magnitude.
Our VLDB
Journal'13 paper gives the foundation for correctly evaluating multi-predicate
spatial queries involving kNN predicates. We present
a rich set of transformation rules for MPS queries to transform the conceptual
evaluation plan into more efficient plans. The experimental evaluation of the
proposed transformation rules shows they are highly effective.
In our
SIGSPATIAL'15 paper, we studied various query optimization heuristics for
queries that involve combinations of kNN select/join
predicates and relational predicates. We demonstrated how spatial locality can significantly enhance the performance of relational
joins. The presented optimizations enhance the performance of these queries
while preserving their semantics. Experimental results that are based on
queries from the TPC-H benchmark and real spatial data from OpenStreetMap
demonstrate that the presented optimizations can achieve orders of magnitude
enhancement in query performance.
In our EDBT'15
paper, we study the problem of estimating the cost of the k-NN-Select and
k-NN-Join operators, which is a needed ingredient to enable the query optimizer
to select among the proposed plans by accurately costing the introduced kNN operators. We present various estimation techniques.
Performance evaluation using real spatial datasets from OpenStreetMap
demonstrates that:
1.
The
Staircase technique for k-NN-Select cost estimation is faster than the
state-of-the-art technique by more than two orders of
magnitude and has better estimation accuracy;
2.
The
Catalog-Merge and Virtual-Grid techniques for k-NN-Join cost estimation achieve
less than 5 and 20 percent error-ratio, respectively, while keeping the
estimation time below one microsecond and one millisecond, respectively; and
3.
The
Virtual-Grid technique reduces the storage required to maintain the catalogs by
an order of magnitude compared to the Catalog-Merge technique.
-- Adaptivity and Big Spatial Data
Techniques (EDBT'14, VLDB'15a, VLDB'16, VLDB'16b)
We studied adaptivity
for general data streams, and then in the context of big spatial data. In our
EDBT'14 paper, we developed JISC, a novel technique for plan adaptation of
continuous queries over data streams. During plan transition, JISC maintains
steady query output in contrast to existing plan adaptation techniques that can
exhibit significant output latencies. JISC employs a lazy state migration
strategy that computes the missing states in the new QEP on-demand in a single
QEP. The lazy migration strategy enables JISC to avoid performance thrashing,
which is a vital issue in all other existing techniques. A key property of JISC
is that during plan transition, it detects the unchanged parts of the QEP and
avoids recomputing their states. Our probabilistic
analysis demonstrates that the number of operators with unchanged states in the
new QEP is usually close to the total number of operators. For that reason,
JISC outperforms existing techniques that also maintain steady query output,
e.g., the Parallel Track Strategy, leading to performance gains during a plan
migration stage that can reach up to an order of magnitude.
In the context
of big spatial data, In our VLDB'15 and VLDB'16
papers, we developed AQWA, a new adaptive and query-workload-aware partitioning
mechanism for large-scale spatial data. We used Hadoop
as our big data platform, for illustration. In AQWA, we addressed several
performance and system challenges; these include the limitations of Hadoop (i.e., the NameNode
bottleneck), the overhead of rebuilding the partitions in HDFS, the dynamic
nature of the data where new batches are created every day, and the issue of
workload-awareness where not only the query workload is skewed, but also it can
change. We showed that AQWA successfully addresses these challenges and
provides an efficient spatial-query processing framework. Using our
implementation of AQWA on a Hadoop cluster, real
spatial data from Twitter, and various workloads of range and kNN queries, we demonstrated that AQWA outperforms SpatialHadoop, the state-of-the-art system. Moreover, we
demonstrated that AQWA incurs little overhead (during the process of
repartitioning the data) that is negligible when compared to the overhead of
recreating the partitions. Although our experimental evaluation is based on Hadoop, AQWA can be applied to other platforms, e.g.,
Storm, a distributed platform for processing streaming data. AQWA can
dynamically reorganize the data in the Storm's computing units (bolts)
according to the query workload and the data distribution.
-- Big Spatial Data Streaming Techniques (ICDE'12, VLDB'15b,
ICDE'16, VLDB'16, SIGMOD'16)
We have published a collection of papers that demonstrate very
high throughput for spatial-keyword (e.g., micro-blogging) data streaming
systems, where we demonstrated up to two order of magnitude enhancement over
existing systems.
Other findings
can be found in the publications below.
Research
Publications:
Spatial Queries with Two kNN Predicates [VLDB 2012]
The widespread use of
location-aware devices has led to countless location-based services in which a
user query can be arbitrarily complex, i.e., one that embeds multiple spatial selection and join predicates. Amongst these predicates, the
k-Nearest-Neighbor (kNN) predicate stands as one of
the most important and widely used predicates. Unlike related research, this
research goes beyond the optimization of queries with single kNN predicates, and shows how queries with two kNN predicates can be optimized. In particular, our
research addresses the optimization of queries with:
(i)
two kNN-select predicates
(ii)
two kNN-join predicates, and
(iii)
one kNN-join predicate and
one kNN-select predicate.
For each type of
queries, conceptually correct query evaluation plans (QEPs) and new algorithms
that optimize the query execution time are studied. Experimental results
demonstrate that the proposed algorithms outperform the conceptually correct
QEPs by orders of magnitude. We conducted the first complete study for the
optimization of queries with two kNN predicates. We
demonstrated how traditional optimization techniques can
compromise the correctness of evaluation for a query that involves two
interacting kNN predicates. For different
combinations of two kNN predicates, we presented
efficient algorithms that guarantee the correct- ness of evaluation, and
outperform the corresponding conceptually correct QEPs by orders of magnitude.
The algorithms introduced are designed for snapshot queries. Applying further
optimization techniques that can support incremental evaluation of continuous
queries with two kNN predicates is a potential future
work. Moreover, we believe that the ideas introduced through this research pave
the way towards a query optimizer that can support spatial queries with more
than two kNN predicates.
M3: Stream Processing on Main-Memory MapReduce [ICDE 2012]
MapReduce has been a popular framework for parallel
computing to process large scale data on large scale
computing clusters. However, most of the current implementations of the MapReduce framework support only the execution of
fixed-input jobs. Such restriction makes these implementations inapplicable for
most streaming applications, in which queries are continuous in nature and
input data streams are continuously received at high arrival rates. In this
research, we realize M3, a prototype implementation of the Hadoop
MapReduce framework in which continuous queries over
streams of data can be efficiently answered. Data-path in M3 is over
main-memory only, bypassing the Hadoop Distributed
File System (HDFS) and any local disk accesses. M3 is resilient to machine
failures and utilizes the main-memory of the machines by dynamically adapting
to the fluctuations in the data arrival rates of the streams. M3 operates on
time pulses, where it has a data acquisition layer with DataTasks
that collect streamed data chunks that arrive during time periods. Map and
Reduce tasks execute continuously while periodically taking turns to process
the chunks of data collected. This research highlights techniques and
challenges for realizing various modes of synchronization and overlap in
execution of tasks in M3. We envision that M3 will be the underlying massively
distributed streaming engine for handling city-scale streamed spatiotemporal
data that we will be realizing next.
Similarity queries: their conceptual
evaluation, transformations, and processing [VLDB Journal 2013]
This research forms the
foundation for multi-predicate spatial queries that involve kNN
predicates. Similarity queries is one possible
generalization of range queries and kNN queries,
where the former can be viewed as a similarity operation (within epsilon
similarity) while the latter is a k-nearest-neighbors similarity. Many
application scenarios can significantly benefit from the identification and
processing of spatial similarities in the data. Even though some work has been
done to extend the semantics of some operators, for example join and selection,
to be aware of data similarities, there has not been much study on the role and
implementation of similarity-aware operations as first-class database
operators. Furthermore, very little work has addressed the problem of
evaluating and optimizing queries that combine several similarity operations.
The focus of this research is the study of similarity queries that contain one
or multiple first-class similarity database operators such as Similarity
Selection, Similarity Join, and Similarity Group-by. Particularly, we analyze
the implementation techniques of several similarity operators, introduce a
consistent and comprehensive conceptual evaluation model for similarity
queries, and present a rich set of transformation rules to extend cost-based
query optimization to the case of similarity queries.
The focus of this
research is the proposal and analysis of several similarity database operators,
and the thorough study of the evaluation and optimization of similarity queries
that combine multiple similarity operators. We introduce a model for the
conceptual evaluation of similarity queries that clearly specifies the way a
similarity query should be evaluated even if it has multiple similarity
operations. We present a rich set of transformation rules for similarity
queries to transform the conceptual evaluation plan into more efficient plans.
Furthermore, we demonstrate that transformation rules for similarity queries
can take advantage of special properties of the similarity-aware operations and
the involved distance functions to enable more useful query transformations. We
also extend the important Eager/Lazy Aggregation transformation rules to the
case of SGB and SJ. The experimental evaluation of the proposed rules shows
they are highly effective. We also show that similarity operators can be
efficiently implemented taking advantage of structures and mechanisms already
available in DBMSs.
Continuous aggregate nearest neighbor
queries [GeoInformatica 2013]
This research addresses
the problem of supporting continuous aggregate nearest-neighbor (CANN) queries
for moving objects in spatio-temporal data stream
management systems. A CANN query specifies a set of landmarks, an integer k,
and an aggregate distance function f (e.g., min, max, or sum), where f computes
the aggregate distance between a moving object and each of the landmarks. The
answer to this continuous query is the set of k moving objects that have the
smallest aggregate distance f. A CANN query may also be viewed as a combined
set of nearest neighbor queries. We introduce several algorithms to
continuously and incrementally answer CANN queries. Extensive experimentation
shows that the proposed operators outperform the state-of-the-art algorithms by
up to a factor of 3 and incur low memory overhead. P-CANN is a best-effort
progressive algorithm that associates thresholds with the individual landmarks.
These thresholds are used to eagerly prune the moving objects. Different
elimination order policies are identified to specify the order of the landmarks
in the computation of the aggregate distance in P-CANN. The optimization of
P-CANN and how to assign the thresholds are addressed. From the performed
extensive experiments, we achieve cases whose performance is improved by up to
a factor of 3 from the state-of-the-art algorithm. P-CANN out-performs both
H-CANN and the CPM algorithm (the state of the art). For the optimization of
P-CANNs, the worst-fit elimination policy gives the
least penalty for sum (additional 14% CPU cost away from optimal) when we do
not use the prohibitively expensive optimal order. On the other hand, the
best-fit elimination policy gives the least penalty for the max case. The
optimization time of P-CANN is about 100 msec for
typical CANN queries. P-CANN, which is a best-effort algorithm
might produce 95% of the required output size on the average. As for future
work, we will investigate the continuous aggregate nearest neighbor queries on
moving objects in road networks. On the one hand, we would like to develop
similar incremental algorithms using the road network distance instead of the
Euclidean distance between the objects. On the other hand, we would like to
make a first-class operator in a data stream management system such that the
CANN operator is considered while optimizing the query evaluation plan.
The Palm-tree Index: Indexing with the
crowd [DBCrowd 2013]
This research addresses
using humans (the crowd) to perform database indexing and searching operations.
Crowdsourcing services allow employing human intelligence in tasks that are
difficult to accomplish with computers such as image tagging and data
collection. At a relatively low monetary cost and through web interfaces such
as Amazon’s Mechanical Turk (AMT), humans can act as a computational operator
in large systems. Recent work has been conducted to build database management
systems that can harness the crowd power in database operators, such as sort,
join, count, etc. The fundamental problem of indexing within crowd-sourced
databases has not been studied. In this research, we study the problem of
tree-based indexing within crowd-enabled databases. We investigate the effect
of various tree and crowdsourcing parameters on the quality of index
operations. We propose new algorithms for index search, insert, and update. We
propose a taxonomy of the problem and highlight its
main challenges. We propose techniques for index construction and querying. We
intend to complete this study with an extensive analysis and experimental
evaluation. Our current work focuses on one-dimensional indexing. We plan to
study other variations of crowd-sourced indexing, such as multidimensional
indexing, spatial and spatio-temporal indexing and
fuzzy indexing.
A Knowledge Cube, or
cube for short, is an intelligent and adaptive database instance capable of
storing, analyzing, and searching data. Each cube is established based on a set
of semantic aspects, e.g., (1) Topical, (2) Contextual, (3) Spatial, or (4)
Temporal. A cube specializes in handling data that is only relevant to its
semantic aspect. Knowledge cubes are inspired by two prime architectures: (1)
"Dataspaces" that provides an abstraction
for data management where heterogeneous data sources can co-exist and it
requires no unifying schema to be specified in advance (2) "Linked
Data" that provides a set of best practices for publishing and
interlinking structured data on the web. A knowledge cube uses the best
practices of Linked Data as its main building block for its data layer and
encompasses some of the data integration abstractions defined by Dataspaces. In this research, we propose knowledge cubes as
a semantically-guided data management architecture,
where data management is influenced by the data semantics rather than by a
predefined scheme. Knowledge cubes support the five pillars of Big Data also
known as the five V's: Volume, Velocity, Veracity, Variety, and Value. Many
interesting opportunities can be leveraged when learning the semantics of the
data. In the research, we highlight these opportunities and propose a strawman design for knowledge cubes along with the research
challenges that arise when realizing them. Our research forms a vision for the
knowledge cube architecture that motivates understanding the semantics of the
data as a mean for solving some of the research challenges we face today,
specifically in Big Data. We also
motivate the use of Linked Data as a mean for achieving this target given the
rich semantics that Linked Data provides. We illustrate some of the challenges
and potential solutions associated with the knowledge cubes proposal. Finally,
we demonstrate how the proposed architectural components can be used to tackle
the five pillars of Big Data.
JISC: Adaptive Stream Processing Using
Just-In-Time State Completion [EDBT 2014]
The
continuous and dynamic nature of data streams may lead a query execution plan
(QEP) of a long-running continuous query to become suboptimal during execution,
and hence will need to be altered. The ability to perform an efficient and
flawless transition to an equivalent, yet optimal QEP is essential for a data
stream query processor. Such transition is challenging for plans with stateful binary operators, such as joins, where the states
of the QEP have to be maintained during query transition without compromising
the correctness of the query output. In this research, we develop a
Just-In-Time State Completion (JISC); a new technique
for query plan migration. JISC does not cause any halt to the query execution,
and thus allows the query to maintain steady output. JISC is applicable to
pipelined as well as eddy-based query evaluation frameworks. During plan
transition, JISC maintains steady query output in contrast to existing plan
adaptation techniques that can exhibit significant output latencies. Maintaining
steady query output is essential for applications that require continuous
monitoring or real-time response. JISC employs a lazy state migration strategy
that computes the missing states in the new QEP on-demand in a single QEP. In
highly dynamic scenarios where the queries and streams have fluctuating selectivities and arrival rates, overlapped plan
transitions are imminent. The lazy migration strategy enables JISC to avoid
performance thrashing, which is a vital issue in all other existing techniques.
A key property of JISC is that during plan transition, it detects the unchanged
parts of the QEP and avoids recomputing their states.
Our probabilistic analysis demonstrates that the number of operators with
unchanged states in the new QEP is usually close to the total number of
operators. For that reason, JISC outperforms existing techniques that also
maintain steady query output, e.g., the Parallel Track Strategy, leading to
performance gains during a plan migration stage that can reach up to an order
of magnitude.
Indexing Recent Trajectories of Moving
Objects [ACM SIGSPATIAL 2014]
Advances
in location-aware and object-tracking devices have led to the development of
location-based services that process large amounts of spatio-temporal
data. The need for efficient processing of queries on the locations of moving
objects calls for trajectory-based indexing methods. These indexes need to
capture both the spatial and temporal dimensions of the data in a way that
minimizes the number of disk I/Os required for both
updating and querying. Most existing spatio-temporal
index structures capture either the current locations of the moving objects or
the entire history of the moving objects. However, many applications require
that only the recent history of a moving object's trajectory be stored. This
research introduces the trails-tree; a disk-based data
structure for indexing the recent history of moving objects' trajectories. The
trails-tree maintains a temporal-sliding window over trajectories and uses: (1)
an in-memory memo structure that reduces the I/O cost of updates using a
lazy-update mechanism, and (2) lazy vacuum-cleaning mechanisms to delete parts
of the trajectories that fall out of the sliding window. Theoretical analysis
and experimental evaluation illustrate that the trails-tree outperforms the
state-of-the art index structures for indexing recent trajectory data by up to
a factor of two.
The
Similarity-aware Relational Intersect Database Operator [SISAP 2014] (Best
Paper Award)
Identifying
similarities in large datasets is an essential operation in many applications such
as bioinformatics, pattern recognition, and data integration. To make the
underlying database system similarity-aware, the core relational operators have
to be extended. Several similarity-aware relational operators have been
proposed that introduce similarity processing at the database engine level,
e.g., similarity joins and similarity group-by. In this research, we extend the
semantics of the set intersection operator to operate over similar values.
We
introduced the semantics and extended SQL syntax of the similarity-based set
intersection operator. We developed an algorithm that is based on a
Mark/Restore mechanism to avoid the O(n^2) complexity.
The proposed operator is implemented inside an open-source database system,
namely PostgreSQL. Several queries from the TPC-H
benchmark are extended to include similarity-based set intersection predicates.
Our implementation of the proposed operator outperforms the queries that
produce the same result using only regular operations. The speedup ranges between
1000 and 4 times for similarity threshold values ranging between 0.01% and 10%
of the attribute domain range. We also demonstrated that the added
functionality is achieved without a big overhead when compared to standard
operators.
HandsOn DB: Managing Data Dependencies Involving Human
Actions [TKDE 2014]
Consider
two values, x and y, in the database, where y = F(x).
To maintain the consistency of the data, whenever x changes, F needs to be
executed to re-compute y and update its value in the database. This is
straightforward in the case where F can be executed by the DBMS, e.g., SQL or C
function. In this published research in [TKDE 2014], we address the more
challenging case where F is a human action, e.g., conducting a wet-lab
experiment, taking manual measurements, or collecting instrument readings. In
this case, when x changes, y remains invalid (inconsistent with the current
value of x) until the human action involved in the derivation is performed and
its output result is reflected into the database. Many application domains,
e.g., scientific applications in biology, chemistry, and physics, contain
multiple such derivations and dependencies that involve human actions. In this
research, we introduce HandsOn DB, a prototype
database engine for managing dependencies that involve human actions while
maintaining the consistency of the derived data. HandsOn
DB includes the following features: (1) semantics and syntax for interfaces
through which users can register human activities into the database and express
the dependencies among the data items on these activities; (2) mechanisms for
invalidating and revalidating the derived data; and (3) new operator semantics
that alert users when the returned query results contain potentially invalid
data, and enable evaluating queries on either valid data only, or both valid
and potentially invalid data. Performance results are presented that study the
overheads associated with these features and demonstrate the feasibility and
practicality in realizing HandsOn DB.
Learning how to write
computer programs is often challenging for the novice with little or no coding
experience. Many pedagogical programming environments motivate programming with
intuitive syntax-free interfaces, interesting real-world contexts, and animated
output. In this research, we introduce LIMO; a web-based programming
environment that is centered around operation on interactive geographical maps,
location-oriented data, and the operations of synthetic objects that move on
the maps. LIMO materializes a low-cost open-ended environment that integrates
interactive maps and spatial data (e.g., OpenStreetMap).
The unique advantage of LIMO is that it relates programming concepts to
geographical maps and locations that have basis in reality. LIMO offers an
environment for learning how to program by providing 1) map and programming
constructs, 2) high-quality interactive map graphics, and 3) example programs
that introduce users to writing programs in the LIMO environment.
AQWA: Adaptive Query-Workload-Aware Partitioning of
Big Spatial Data [VLDB 2016 – Full Paper (To Appear)] (paper) (AQWA code)
In support of big
spatial data, this research focuses on the data layer in a cluster-based
platform. The motivation for this research is the unprecedented spread of
location-aware devices that has resulted in a plethora of location-based
services in which huge amounts of spatial data need to be efficiently processed
by large-scale computing clusters. Existing cluster-based systems for
processing spatial data employ static data partitioning structures that cannot
adapt to data changes, and that are insensitive to the query-workload. Hence,
these systems are not able to consistently provide good performance. To close
this gap, we present AQWA, an adaptive and query-workload-aware data
partitioning mechanism for processing large-scale spatial data. AQWA does not
assume prior knowledge of the data distribution or the query-workload. Instead,
AQWA incrementally reacts to changes in both the data and the query-workload.
As data is consumed and queries are processed, the data partitions are
incrementally updated. With extensive experiments using real spatial data from
Twitter, we demonstrate that AQWA can achieve an order of magnitude gain in
query performance compared to standard spatial partitioning structures.
Spatial Queries with k-Nearest-Neighbor and
Relational Predicates [ACM SIGSPATIAL 2015 – Full Paper] (paper) (Code)
This research is an
important milestone for this project as it shows all the feasible optimizations
when k-Nearest-Neighbor (kNN, for short) predicates
are intermixed with relational predicates in SQL-like query languages. This
research is motivated by the ubiquity of location-aware devices and smartphones
that has unleashed an unprecedented proliferation of location-based services
that require processing queries with both spatial and relational predicates.
Many algorithms and index structures already exist for processing kNN predicates either solely or when combined with textual
keyword search. Unfortunately, there has not been enough study on how to
efficiently process queries where kNN predicates are
combined with general relational predicates, i.e., ones that have selects,
joins and group-by’s. One major challenge is that because the kNN is a ranking operation, applying a relational predicate
before or after a kNN predicate in a query evaluation
pipeline (QEP, for short) can result in different outputs, and hence leads to
different query semantics. In particular, this renders classical relational
query optimization heuristics, e.g., pushing selects below joins, inapplicable.
This research introduces and studies various query optimization heuristics for
queries that involve combinations of kNN select/join
predicates and relational predicates. The proposed optimizations can
significantly enhance the performance of these queries while preserving their
semantics. Experimental results that are based on queries from the TPC-H
benchmark and real spatial data from OpenStreetMap
demonstrate that the proposed optimizations can achieve orders of magnitude
enhancement in query performance.
Cost Estimation of Spatial k-Nearest-Neighbor
Operators EDBT 2015 (paper) (Code)
Advances
in geo-sensing technology have led to an unprecedented spread of location-aware
devices. In turn, this has resulted into a plethora of location-based services
in which huge amounts of spatial data need to be efficiently consumed by
spatial query processors. For a spatial query processor to properly choose
among the various query processing strategies, the
cost of the spatial operators has to be estimated. In this research, we study the
problem of estimating the cost of the spatial k-nearest-neighbor (k-NN, for
short) operators, namely, k-NN-Select and k-NN-Join. Given a query that has a
k-NN operator, the objective is to estimate the number of blocks that are going
to be scanned during the processing of this operator. Estimating the cost of a
k-NN operator is challenging for several reasons. For instance, the cost of a k-NN-Select operator is directly affected by the value of
k, the location of the query focal point, and the distribution of the data.
Hence, a cost model that captures these factors is relatively hard to realize.
This study introduces cost estimation techniques that maintain a compact set of
catalog information that can be kept in main-memory to enable fast estimation
via lookups. A detailed study of the performance and accuracy trade-off of each
proposed technique is presented. Experimental results using real spatial
datasets from OpenStreetMap demonstrate the
robustness of the proposed estimation techniques.
A Demonstration of AQWA: Adaptive Query Workload
Aware Partitioning of Big Spatial Data
[VLDB 2015 Demo paper] (Demo paper) (AQWA code)
During Academic Year
2014-2015, we have realized a prototype open-source system for AQWA. AQWA is
motivated by the needs for efficiently organizing Big Spatial Data. The
ubiquity of location-aware devices, e.g., smartphones and GPS-devices, has led to
a variety of location-based services in which huge amounts of geo-tagged
information is created every day. This prototype system demonstrates AQWA, a
novel mechanism for partitioning large-scale spatial data over distributed
clusters. Unlike existing cluster-based systems, e.g. SpatialHadoop,
that apply static partitioning of spatial data, AQWA has the ability to react
to changes in query-workload or data distribution. A key feature of AQWA is
that it does not assume prior knowledge of the query-workload or data
distribution. Instead, AQWA adapts to changes in the query-workload or data
distribution in an online fashion by incrementally updating the partitioning of
the data in a way that enhances the query execution time. We demonstrate two
real prototypes of AQWA applied to both Hadoop and
Spark platforms. In both prototypes, we process spatial range and
k-nearest-neighbor queries over large-scale real and synthetic datasets, and
exploit the performance of AQWA under different query-workloads. Currently, we
are making the code for AQWA on Hadoop publicly
available.
Tornado: A Distributed SpatioTextual
Stream Processing System [VLDB 2015 Demo paper] (Demo
paper)
During Academic Year
2014-2015, we have realized a prototype open-source system termed Tornado.
Tornado is motivated by the needs for efficiently querying Big Spatial Data.
Nowadays, most spatial data is also associated with text data. The widespread
use of location-aware devices together with the increased popularity of
micro-blogging applications (e.g., Twitter) has led to the creation of large
streams of spatio-textual data. In order to serve
real-time applications, the processing of these large-scale spatio-textual
streams needs to be distributed. However, existing distributed stream
processing systems (e.g., Spark and Storm) are not optimized for
spatial/textual content. In this demonstration, we introduce Tornado, a
distributed in-memory spatio-textual stream processing server that extends Storm. To efficiently
process spatio-textual streams, Tornado introduces a spatio-textual indexing layer to the architecture of Storm.
The indexing layer is adaptive, i.e., dynamically re-distributes the processing
across the system according to changes in the data distribution and/or query
workload. In addition to keywords, higher-level textual concepts are identified
and are semantically matched against spatio-textual
queries. Tornado provides data de-duplication and fusion to eliminate redundant
textual data. We demonstrate a prototype of Tornado running against real
Twitter streams, where the users can register continuous or snapshot spatio-textual queries using a map-assisted query
interface.
Kangaroo: Workload-Aware Processing of Range
Queries in Hadoop [WSDM 2016 – Full Paper] (paper)
Despite the importance
and widespread use of range data, e.g., time intervals, spatial ranges, etc.,
little attention has been devoted to study the processing and querying of range
data in the context of big data. The main challenge relies in the nature of the
traditional index structures, e.g., B-Tree and R-Tree, being centralized by
nature, and hence are almost crippled when deployed in a distributed
environment. To address this challenge, in this study, we prototype Kangaroo, a
system built on top of Hadoop to optimize the
execution of range queries over range data. The main idea behind Kangaroo is to
split the data into non-overlapping partitions in a way that minimizes the
query execution time. Kangaroo is query workload aware, i.e., results in
partitioning layouts that minimize the query processing time of given query
patterns. In this study, we expose and address the design challenges Kangaroo
addresses in order to be deployed on top of a distributed file system, i.e.,
HDFS. We also study four different partitioning schemes that Kangaroo can
support. With extensive experiments using real range data of more than one
billion records and real query workload of more than 30,000 queries, we show
that the partitioning schemes of Kangaroo can significantly reduce the I/O of
range queries on range data.
Graph Indexing for Shortest-Path Finding over
Dynamic Sub-Graphs [SIGMOD 2016 – Full Paper] (paper)
A variety of
applications spanning various domains, e.g., social networks, transportation,
and bioinformatics, have graphs as first-class citizens. These applications
share a vital operation, namely, finding the shortest path between two nodes.
In many scenarios, users are interested in filtering the graph before finding
the shortest path. For example, in social networks, one may need to compute the
shortest path between two persons on a sub-graph containing only family
relationships. This research focuses on dynamic graphs with labeled edges,
where the target is to find a shortest path after filtering some edges based on
user-specified query labels. This problem is termed the Edge-Constrained
Shortest Path query (or ECSP, for short). This research introduces
Edge-Disjoint Partitioning (EDP, for short), a new technique for efficiently answering
ECSP queries over dynamic graphs. EDP has two main components: a dynamic index
that is based on graph partitioning, and a traversal
algorithm that exploits the regular patterns of the answers of ECSP queries.
EDP partitions the graph based on the labels of the edges. On demand, EDP
computes specific sub-paths within each partition and updates its index. The
computed sub-paths are cached and can be leveraged by future queries. To answer
an ECSP query, EDP connects sub-paths from different partitions using its
efficient traversal algorithm. EDP can dynamically handle various types of
graph updates, e.g., label, edge, and node updates. The index entries that are
potentially affected by graph updates are invalidated and are re-computed on
demand. EDP is evaluated using real graph datasets from various domains.
Experimental results demonstrate that EDP can achieve query performance gains
of up to four orders of magnitude in comparison to state of the art techniques.
Cruncher: Distributed In-Memory Processing for
Location-Based Services [IEEE ICDE 2016] (Demo
paper)
Advances in
location-based services (LBS) demand high-throughput processing of both static
and streaming data. Recently, many systems have been introduced to support
distributed main-memory processing to maximize the query throughput. However,
these systems are not optimized for spatial data processing. In this system, we
showcase Cruncher, a distributed main-memory spatial data warehouse and
streaming system. Cruncher extends Spark with adaptive query processing
techniques for spatial data. Cruncher uses dynamic batch processing to
distribute the queries and the data streams over commodity hardware according
to an adaptive partitioning scheme. The batching technique also groups and
orders the overlapping spatial queries to enable inter-query optimization. Both
the data streams and the offline data share the same partitioning strategy that
allows for data co-locality optimization. Furthermore, Cruncher uses an
adaptive caching strategy to maintain the frequently-used
location data in main memory. Cruncher maintains operational statistics to
optimize query processing, data partitioning, and caching at runtime. We demonstrate
two LBS applications over Cruncher using real datasets from OpenStreetMap
and two synthetic data streams. We demonstrate that Cruncher achieves order(s)
of magnitude throughput improvement over Spark when processing spatial data.
Efficient Processing of Hamming-Distance-Based
Similarity Search Queries Over MapReduce [EDBT 2015]
(paper)
Similarity and proximity
searches are crucial to many applications. Of particular interest are two
flavors of the Hamming distance range query, namely, the Hamming select and the
Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance
is widely used in approximate near neighbor search for high dimensional data, such
as images and document collections. For example, using predefined similarity
hash functions, high-dimensional data is mapped into
one-dimensional binary codes that are, then linearly scanned to perform
Hamming-distance comparisons. These distance comparisons on the binary codes
are usually costly and, often involve excessive redundancies. In this study, we
introduce a new index, termed the HA-Index, that
speeds up distance comparisons and eliminates redundancies when performing the
two flavors of Hamming distance range queries. An efficient search algorithm
based on the HA-index is presented. A distributed version of the HA-index is
introduced and algorithms for realizing Hamming distance-select and Hamming
distance-join operations on a MapReduce platform are
prototyped. Extensive experiments using real datasets demonstrates that the
HA-index and the corresponding search algorithms achieve up to two orders of
magnitude speedup over existing state-of-the-art approaches, while saving more
than ten times in memory space.
Geo-tagging Non-Spatial Concepts [ACM SIGSPATIAL
Mobile GIS workshop 2015] (paper)
This research addresses
a new interesting type of queries, namely, geo-tagging non-spatial concepts.
For example, consider the following queries: Show on the map potential crime
locations in Chicago, or show on the map potential pollution regions in San
Francisco. Because “crime” and “pollution” are non-spatial concepts, it is hard
to pin-point a location on the map to place these
concepts. In general, Concept
Geo-tagging is the process of assigning a textual identifier that describes a
real-world entity to a physical geographic location. A concept can either be a
spatial concept where it possesses a spatial presence or be a non-spatial
concept where it has no explicit spatial presence. Geo-tagging locations with
non-spatial concepts that have no direct relation is a very useful and
important operation but is also very challenging. The reason is that, being a
non-spatial concept, e.g., crime, makes it hard to
geo-tag it. In this study, we propose to use the semantic information
associated with concepts and locations such as the type as a mean for
identifying these relations. The co-occurrence of spatial and non-spatial
concepts within the same textual resources, e.g., in the web, can be an
indicator of a relationship between these spatial and non-spatial concepts.
Techniques are presented for learning and modeling relations among spatial and
non-spatial concepts from web textual resources. Co-occurring concepts are
extracted and modeled as a graph of relations. This graph is used to infer the
location types related to a concept. A location type can be a hospital, restaurant,
an educational facility and so forth. Due to the immense number of relations
that are generated from the extraction process, a semantically-guided
query processing algorithm is introduced to prune the graph to the most
relevant set of related concepts. For each concept, the most relevant types are
matched against the location types. Experiments evaluate the proposed algorithm
based on its filtering efficiency and the relevance of the discovered
relationships. Performance results illustrate how semantically-guided
query processing can outperform the baseline in terms of efficiency and
relevancy. The proposed approach achieves an average precision of 74% across
three different datasets.
LIMO: Learning Programming using Interactive Map
Activities [ACM SIGSPATIAL 2015 Demo Paper] (Demo paper)
Advances in geographic
information, interactive two- and three-dimensional map visualization
accompanied with the proliferation of mobile devices and location data have
tremendously benefited the development of geo-educational applications. We
demonstrate LIMO; a web-based programming environment that is centered around
operations on interactive geographical maps, location-oriented data, and the
operations of synthetic objects that move on the maps. LIMO materializes a
low-cost open-ended environment that integrates interactive maps and spatial
data (e.g., Open-StreetMap). The unique advantage of
LIMO is that it relates programming concepts to interactive geo- graphical maps
and location data. LIMO offers an environment for students to learn how to
program by providing 1. an easy to program library of
map and spatial operations, 2. high-quality
interactive map graphics, and 3. example programs that
introduce users to writing programs in the LIMO environment.
On Map-Centric Programming Environments [ACM
SIGSPATIAL 2015 Vision Paper] (Vision paper)
2D maps and 3D globes
can be used as programming toys to help students learn programming in contrast
to using robots, visual, or multimedia components in Computer Science I
introductory programming courses. This study exposes research challenges related
to supporting this concept, and presents one instance of a 2D and 3D map-based
programming environment to motivate these challenges.
Similarity Group-by Operators for Multi-dimensional
Relational Data [IEEE TKDE 2016] (paper)
The SQL group-by operator
plays an important role in summarizing and aggregating large datasets in a data
analytics stack. While the standard group-by operator, which is based on equality,
is useful in several applications, allowing proximity-aware grouping provides a
more realistic view on real-world data that could lead to better insights. The
Similarity SQL-based Group-By operator (SGB, for short) extends the semantics
of the standard SQL Group-by by grouping data with similar but not necessarily
equal values. While existing similarity-based grouping operators efficiently
realize these approximate semantics, they primarily focus on one-dimensional
attributes and treat multi-dimensional attributes independently. However,
correlated attributes, such as in spatial data, are processed independently,
and hence, groups in the multi-dimensional space are not detected properly. To
address this problem, we introduce two new SGB operators for multi-dimensional
data. The first operator is the clique (or distance-to-all) SGB, where all the
tuples in a group are within some distance from each other. The second operator
is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within
some distance from any other tuple in the group. Since a tuple may satisfy the
membership criterion of multiple groups, we introduce three different semantics
to deal with such a case: (i) eliminate the tuple,
(ii) put the tuple in any one group, and (iii) create a new group for this
tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators
proves to be minimal and the execution times are comparable to those of the
standard Group-by. The experimental study, based on TPC-H and a social check-in
data, demonstrates that the proposed algorithms can achieve up to three orders
of magnitude enhancement in performance over baseline methods developed to
solve the same problem.
The Similarity-aware Relational Database Set
Operators [Information Systems Journal, 2016] (paper)
Finding data items in
the proximity of a query item can be formulated as a similarity operation.
Identifying these similarities in large datasets is an essential operation in
several applications such as bioinformatics, pattern recognition, and data
integration. To make a relational database management system similarity-aware,
the core relational operators have to be extended. While similarity-awareness
has been introduced in database engines for relational operators such as joins
and group-by, little has been achieved for relational set operators, namely
Intersection, Difference, and Union. In this paper, we propose to extend the
semantics of relational set operators to take into account the similarity of
values. We develop efficient query processing algorithms for evaluating them,
and implement these operators inside an open-source database system, namely PostgreSQL. By extending several queries from the TPC-H
benchmark to include predicates that involve similarity-based set operators, we
perform extensive experiments that demonstrate up to three orders of magnitude
speedup in performance over equivalent queries that only employ regular
operators.
ORLF: A Flexible Framework for Online Record
Linkage and Fusion [IEEE ICDE 2016] (Demo
paper)
With the exponential
growth of data on the Web comes the opportunity to integrate multiple sources
to give more accurate answers to user queries. Upon retrieving records from
multiple Web databases, a key task is to merge records that refer to the same
real-world entity. We demonstrate ORLF (Online Record Linkage and Fusion), a
flexible query-time record linkage and fusion framework. ORLF de-duplicates
newly arriving query results jointly with previously processed query results.
We use an iterative caching solution that leverages query locality to
effectively de-duplicate newly incoming records with cached records. ORLF aims
to deliver timely query answers that are duplicate-free and reflect knowledge
collected from previous queries.
Approving Updates in Collaborative Databases [IEEE
IC2E 2015] (paper)
Although this research
was motivated by and was started in the context of cleaning dirty spatial data,
the techniques developed proved to be useful and generally applicable to many
application domains including scientific data curation.
Data curation activities, especially in collaborative
databases, mandate that collaborators interact until they converge and agree on
the content of their data. Typically, updates by a member of the collaboration
are made visible to all collaborators for comments but at the same time are
pending the approval or rejection of the data custodian, e.g., the principal
scientist or investigator (PI). In current database technologies, approval and
authorization of updates are based solely on the identity of the user, e.g.,
via the SQL GRANT and REVOKE commands. However, in collaborative environments,
the updated data is open for collaborators for discussion and further editing
and is finally approved or rejected by the PI based on the content of the data
and not on the identity of the updater. In this research, we introduce a
cloud-based collaborative database system that promotes and enables collaboration
and data curation scenarios. We realize content-based
update approval and history tracking of updates inside HBase,
a distributed and scalable open-source cluster-based database. The design and
implementation as well as a detailed performance study of several approaches
for update approval are investigated and alternative techniques are contrasted.
Precision-Bounded Access Control Using
Sliding-Window Query Views for Privacy-Preserving Data Streams [TKDE 2015] (paper)
This research extends
spatial indexing and spatial query processing techniques to apply in the
context of access control and preserving privacy over data streams. Access control
mechanisms and Privacy Protection Mechanisms (PPM) have been proposed for data
streams. The access control for a data stream allows roles access to tuples
satisfying an authorized predicate sliding-window query. Sharing the sensitive
stream data without PPM can compromise the privacy. The PPM meets privacy
requirements, e.g., k-anonymity or l-diversity by generalization of stream
data. Imprecision introduced by generalization can be reduced
by delaying the publishing of stream data. However, the delay in sharing
the stream tuples to achieve better accuracy can lead to false-negatives if the
tuples are held by PPM while the query predicate is evaluated. Administrator of
an access control policy defines the imprecision bound for each query. The
challenge for PPM is to optimize the delay in publishing of stream data so that
the imprecision bound for the maximum number of queries is satisfied. We
formulate the precision-bounded access control for privacy-preserving data
streams problem, present the hardness results, provide an anonymization
algorithm, and conduct experimental evaluation of the proposed algorithm.
Experiments demonstrate that the proposed heuristic provides better precision
for a given data stream access control policy as compared to the minimum or
maximum delay heuristics proposed in existing literature.
GeoTrend: spatial trending queries on
real-time microblogs [SIGSPATIAL 2016] (paper)
In this research, we
develop GeoTrend; a system for scalable support
of spatial trend discovery on recent microblogs,
e.g., tweets and online reviews, that come in real time. GeoTrend is distinguished from existing techniques in
three aspects: (1) It discovers trends in arbitrary
spatial regions, e.g., city blocks. (2) It supports trending measures that effectively
capture trending items under a variety of definitions that suit different
applications. (3) It promotes recent microblogs as
first-class citizens and optimizes its system components to digest a continuous
flow of fast data in main-memory while removing old data efficiently. GeoTrend queries are top-k queries that discover
the most trending k keywords that are posted within an arbitrary
spatial region and during the last T time units. To support its
queries efficiently, GeoTrend employs an
in-memory spatial index that is able to efficiently digest incoming data and
expire data that is beyond the last T time units. The index also
materializes top-k keywords in different spatial regions so that incoming
queries ctiaan be processed with low latency. In case
of peak times, a main-memory optimization technique is employed to shed less
important data, so that the system still sustains high query accuracy with
limited memory resources. Experimental results based on real Twitter feed and
Bing Mobile spatial search queries show the scalability of GeoTrend to support arrival rates of up to 50,000 microblog/second, average query latency of 3 milli-seconds, and at least 90+% query accuracy even under
limited memory resources.
Atlas: On the expression of spatial-keyword group
queries using extended relational constructs [SIGSPATIAL 2016] (paper)
This research addresses
the query language and expressiveness of spatial keyword queries. The
popularity of GPS-enabled cellular devices introduced numerous applications,
e.g., social networks, micro-blogs, and crowd-powered reviews. These
applications produce large amounts of geo-tagged textual data that need to be
processed and queried. Nowadays, many complex spatio-textual
operators and their matching complex indexing structures are being proposed in
the literature to process this spatio-textual data.
For example, there exist several complex variations of the spatio-textual
group queries that retrieve groups of objects that collectively satisfy certain
spatial and textual criteria. However, having complex operators is against the
spirit of SQL and relational algebra. In contrast to these complex spatio-textual operators, in relational algebra, simple
relational operators are offered, e.g., relational selects, projects, order by,
and group by, that are composable
to form more complex queries. In this research, we introduce Atlas, an SQL
extension to express complex spatial-keyword group queries. Atlas follows the
philosophy of SQL and relational algebra in that it uses simple declarative
spatial and textual building-block operators and predicates to extend SQL. Not
only that Atlas can represent spatio-textual group
queries from the literature, but also it can compose other important queries,
e.g., retrieve spatio-textual groups from subsets of
object datasets where the selected subset satisfies user-defined relational
predicates and the groups of close-by objects contain miss-spelled keywords. We
demonstrate that Atlas is able to represent a wide range of spatial-keyword
queries that existing indexes and algorithms would not be able to address. The
building- block paradigm adopted by Atlas creates room for query optimization,
where multiple query execution plans can be formed.
Spatial Computing [CACM 2016] (Overview paper)
Spatial computing
encompasses the ideas, solutions, tools, technologies, and systems that
transform our lives by creating a new understanding of locations—how we know,
communicate, and visualize our relationship to locations and how we navigate
through them. Pervasive GPS allows hikers in national parks, boaters on lakes,
children visiting new places, and taxis (or Uber
drivers or self-driving cars) and unmanned aerial vehicles to know their
locations, nearby facilities, and routes to reach places of interest. This
article provides an overview about the future of spatial computing and its
roles in the various aspects of our lives.
LocationSpark: A Distributed In-Memory Data
Management System for Big Spatial Data [VLDB 2016] (demo
paper)
We present LocationSpark, a spatial data processing system built on
top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spa- tial
query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark
employs various spatial indexes for in-memory data, and guarantees that
immutable spatial indexes have low overhead with fault tolerance. In addition,
we build two new layers over Spark, namely a query scheduler and a query
executor. The query scheduler is responsible for mitigating skew in spatial
queries, while the query executor selects the best plan based on the indexes
and the nature of the spatial queries. Furthermore, to avoid unnecessary
network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark’s indexes. Finally, LocationSpark
tracks frequently accessed spatial data, and dynamically flushes less
frequently accessed data into disk. We evaluate our system on real workloads
and demonstrate that it achieves an order of magnitude performance gain over a
baseline framework.
In-Memory Distributed Matrix Computation Processing
and Optimization [ICDE 2017] (paper)
The use of large-scale
machine learning and data mining methods is becoming ubiquitous in many
application domains ranging from business intelligence and bioinformatics to
self-driving cars. These methods heavily rely on matrix computations, and it is
hence critical to make these computations scalable and efficient. These matrix
computations are often complex and involve multiple steps that need to be
optimized and sequenced properly for efficient execution. This research
introduces new efficient and scalable matrix processing and optimization
techniques for in-memory distributed clusters. The proposed techniques estimate
the sparsity of intermediate matrix-computation
results and optimize communication costs. An evaluation plan generator for
complex matrix computations is introduced as well as a distributed plan
optimizer that exploits dynamic cost-based analysis and rule-based heuristics
to optimize the cost of matrix computations in an in-memory distributed
environment. The result of a matrix operation will often serve as an input to
another matrix operation, thus defining the matrix data dependencies within a
matrix program. The matrix query plan generator produces query execution plans
that minimize memory usage and communication overhead by partitioning the
matrix based on the data dependencies in the execution plan. We implemented the
proposed matrix processing and optimization techniques in Spark, a distributed
in-memory computing platform. Experiments on both real and synthetic data
demonstrate that our proposed techniques achieve up to an order-of-magnitude
performance improvement over state-of the-art distributed matrix computation
systems on a wide range of applications.
Query Processing Techniques for Big Spatial-Keyword
Data [SIGMOD Tutorial 2017] (paper)
(slides)
The widespread use of
GPS-enabled cellular devices, i.e., smart phones, has led to the popularity of
numerous mobile applications, e.g., social networks, micro-blogs, mobile web
search, and crowd-powered reviews. These applications generate large amounts of
geo-tagged textual data, i.e., spatial-keyword data. This data needs to be
processed and queried at an unprecedented scale. The management of
spatial-keyword data at this scale goes beyond the capabilities of centralized
systems. We live in the era of big data and the big data model is currently
been used to address scalability issues in various application domains. This
has led to the development of various big spatial-keyword processing systems.
These systems are designed to ingest, store, index, and query huge amounts
of spatial-keyword data. In this 1.5 hour
tutorial, we explore recent research efforts in the area of big spatial-keyword
processing. First, we give main motivations behind big spatial-keyword systems
with real-life applications. We describe the main models for big
spatial-keyword processing, and list the popular spatial-keyword queries. Then,
we present the approaches that have been adopted in big spatial-keyword
processing systems with special attention to data indexing and spatial and
keyword data partitioning. Finally, we conclude this tutorial with a discussion
on some of the open problems and research directions in the area of big
spatial-keyword query processing.
A Survey of Shortest-Path Algorithms [CoRR 2017] (paper)
A shortest-path algorithm
finds a path containing the minimal cost between two vertices in a graph. A
plethora of shortest-path algorithms is studied in the literature that span
across multiple disciplines. This research surveys shortest-path algorithms
based on a taxonomy that is introduce. One dimension of this taxonomy is the
various flavors of the shortest-path problem. There is no one general algorithm
that is capable of solving all variants of the shortest-path problem due to the
space and time complexities associated with each algorithm. Other important
dimensions of the taxonomy include whether the shortest-path algorithm operates
over a static or a dynamic graph, whether the
shortest-path algorithm produces exact or approximate answers, and whether the
objective of the shortest-path algorithm is to achieve time-dependence or is to
only be goal directed. This survey studies and classifies shortest-path
algorithms according to the proposed taxonomy. The survey also presents the
challenges and proposed solutions associated with each category in the
taxonomy.
Teaching,
Education, Training, and Development
2011-2012
Education and Course Design:
In Fall 2011, Aref has offered
a graduate-level seminar class CS698. The purpose of this course is to train
students in spatiotemporal data management research. Students learned the basic
and advanced techniques in spatial and spatiotemporal data management through
several interesting projects. The main class project is to design an
introductory level (CS-I) programming course based on
spatiotemporal data management techniques, e.g., using 2D and 3D map and moving
objects’ operations. More detail on the CS Programming-I course is given below.
In the class, students covered some of the fundamental papers in spatial and
spatiotemporal data management as well as the useful tools that facilitate
spatial and spatiotemporal data systems research, e.g., data generators,
benchmarks, open source systems, extensible indexing architectures, etc. This
course counts as one of the two CS-698 research courses a Ph.D. student can
take in his/her first three semesters or as a CS590 independent study course.
The class meets once a week on Wednesdays. Each student was expected to give
around two to three presentations over the span of the semester. In addition to
the CS Programming-1 course project that every student is expected to
participate in, each student or possibly a group project of teams of one-three
students each selects a semester-long project. Each student (or each group) was
expected to meet with Walid on individual basis once a week to follow up on the
project’s progress. During class, group discussions evolve around the CS
Programming-I Project following some students' presentations.
CS Programming-I Project:
This part of the class
involved lots of brainstorming and design activities. The purpose of this
course project is to design an introductory programming course in computer
science to address low retention of computer science freshman year and their
diminished interested in science. The target is to go beyond the classical
'Hello World' programming examples that have shown their limits. The goal of
this project is to design the CS Programming I course around 2D and 3D map
operations (e.g., using 3D Earth and 2D Map APIs) to increase the interest of
students in programming, both in CS and in Science. According to CRA reports,
the interest among undergraduate students in majoring in Computer Science has
decreased significantly. Universities nationwide have been experiencing a
noticeable decline in freshman registration in CS for the past years. For
example, the number of new CS majors in Fall 2007 was half of what it was in
Fall 2000 (15,958 versus 7,915). Several colleges attribute this decline to the
increase in computer literacy among incoming students and the boring nature of
the freshman Programming I courses. A first program that prints 'Hello World!'
on the screen is not thrilling anymore to students playing computer games with
high-quality graphics since KG. To add excitement, several colleges have
redesigned their freshman Programming I courses to use toy devices, e.g.,
Robots. The target would be to program the robot to perform certain tasks while
learning various programming concepts in the process. Although interesting to
some students, using robots has several disadvantages. High setup cost in
obtaining and maintaining the robots is one set back. Another disadvantage is
the frustration that many students have when dealing with hardware as well as
software debugging issues simultaneously. The goal of this project is to design
a CS Programming-I course that uses 2D and 3D maps (e.g., Google Earth and
Google Maps) as the ‘toy' through which freshman CS students learn programming
concepts. Besides the low setup cost, learning using 2D and 3D maps adds the
excitement needed by introducing high-quality graphics and high interactivity
by relating the programming tasks to maps or photos of locations that the
students can relate to. Wrapper APIs needs to be developed to simplify the
setup and interaction with the map software. The new CS Programming-I course
needs to be developed along the same lines of the first CS Programming-1 course
at Purdue (CS-180).
Education and Course Design Progress:
11 graduate students (8
males and 3 females) attended the CS698 class that Aref offered in Fall 2011.
Topics covered in class included:
-
Dissection of Purdue’s Existing CS-177 & CS-180 introductory CS Programming
courses and their exercises.
-
Study of 2D and 3D
Map APIs (e.g., Google’s and Microsoft products, among others)
-
Study of spatial and
spatiotemporal data operations
-
What are the basic,
minimal, and complete sets of spatiotemporal operations?
-
Study of Map and GIS
(Geographic Information Systems) operations
-
Classic
multi-dimensional (spatial) and spatiotemporal data indexing techniques
-
Spatial and
spatiotemporal query processing algorithms
-
Spatial and
spatiotemporal query optimization
-
Spatial and spatiotemporal
data management systems
-
Spatiotemporal data
streaming systems
-
Study of publicly
available spatial and spatiotemporal data sets
-
Synthetic
spatiotemporal data generators
-
Benchmarking spatial
and spatiotemporal data systems
Student projects in the
course included: spatiotemporal query optimization, a study of publicly
available spatial and spatiotemporal data sets (real and synthetic), parallel
spatiotemporal data indexing, distance oracles, map matching, a study of
spatiotemporal data generators. Several 2D- and 3D-based programming project
ideas and programming primitives were developed for the CS Programming I
course. These ideas need further elaboration and are currently being studied by
Walid and his group.
Training and Development:
Students have gained
research experience in the context of this project. They have published their
results in rigorous conference outlets, e.g., VLDB, which is one of the two top
and most prestigious and competitive publication venues in the field of database
systems (along with the ACM SIGMOD Conference). Mr. Aly
got his paper accepted into VLDB 2012, which was an excellent experience for
him as a first publication in his career. During the graduate seminar course
(CS698) that Aref offered in Fall 2011, students (8 males and 3 females) have
benefitted tremendously as, through the class activities, students gave many
class presentations about fundamental ideas and fundamental papers in the field
of spatial database systems as well as cover progress in their class projects.
Also, students learned through this experience how to criticize ideas of others
in a balanced fashion, whether published work of others, or those of their
colleagues in the class. Through this class experience, several students got
started in their Ph.D. research, including Mr. Mahmood
and Mr. Tang. Students have also gained tremendous experience working with
large open-source software systems, e.g., Hadoop. Our
M3 project has exposed them to invaluable software development experience while
adapting Hadoop by eliminating all the disk layers
from it and turning it into a massively distributed and parallel data streaming
system. Mr. Aly was later invited to spend the summer
at Microsoft Research where he sharpened his research skills further.
2012-2013
Education and Course Design
Aref has offered over 15
graduate research independent study courses (CS590, CS698, CS699) in Fall 2012
and Spring 2013. The purpose of these courses is to train graduate students
individually or in small group in various topics in spatiotemporal data
management research. Students learn advanced topics in spatial and
spatiotemporal data management through several interesting independent study or
group projects. Aref currently has 11 Ph.D. students (10 in CS, 1 in ECE) three
of which were supported by this grant and one additional woman Ph.D. student
(Ruby Tahboub) will be supported by this grant as of
January 2014. One of the targets of this project's education plan is to design
an introductory programming course in computer science to address low retention
of computer science freshman year and their diminished interest in science. The
target is to go beyond the classical 'Hello World' programming examples that
are not thrilling anymore to students playing computer games with high-quality
graphics since KG. Our goal is to design a CS Programming-I course that uses 2D
and 3D maps (e.g., Google Earth and Google Maps) as the 'toy' through which
freshman CS students learn programming concepts in contrast to using robots as
other programming environments do. Aref and one Ph.D. student have been working
on developing this introductory-level (CS-I) programming course based on
spatiotemporal data management techniques, e.g., using 2D and 3D map and moving
objects' operations. We have focused on studying several existing introductory
programming environments, e.g., Scratch and Alice. We
have started developing 2D map-based programming primitives and abstract
operations that serve as building blocks for the map-based programming language
to be developed. Currently, we are using these primitives to develop some
simple programming exercises to serve as a way to debug these primitives. Once
the design of these map primitives and programming abstractions converge and
are finalized, we will start building these primitives for testing.
Training and Development
Students have
gained research experience in the context of this project. They have published
their results in rigorous conference and journal outlets, e.g., the VLDB
Journal, the IEEE Transactions on Knowledge and Data Engineering, the GeoInformatica Journal, the IEEE BigData
Conference, and the VLDB DBCrowd workshop. Many of
the participating students gained excellent experience as this was their first publications in their careers. During the graduate
independent study courses, Ph.D. students (10 males and 1 female) have
benefitted from the various class activities, and many of them got
"jumpstarted" into spatiotemporal data systems research through
systems-oriented projects that resulted into conference and journal paper
submissions.
Several students,
e.g., Mr. Madkour and Mr. Mahmood
(both were supported under this project in the 2012/2013 academic year), got
their first research papers accepted into the IEEE BigData
2013 and VLDB DBCroud 2013, respectively. Students
have also gained experience working with large open-source software systems,
e.g., Hadoop and PostgreSQL.
Because of this experience, four of the Ph.D. students got invited for summer
internships at various industrial and research labs.
Post-doctorate Dr. Shen Zhihong from the Chinese
Academy of Sciences has visited Purdue to collaborate with Aref’s
research group at his government's expenses to get training in the context of
this project. Also, Dr. Saleh Basalamah,
Director of the GIS Technology Innovation Center, Umm Al-Qura
University, Saudi Arabia, has spent an academic year at Purdue to collaborate
in the context of this project. Dr. Basalamah is a
co-author in the DBCrowd 2013 and IEEE BigData 2013 conference papers.
Dissemination of Results to Communities of Interest
and Outreach Activities:
In 2012/2013, Aref
has offered the over 15 independent graduate research courses, which introduced
the topics in the project to 11 Ph.D. graduate students during the fall and
spring semesters 2013. Walid has served as an organization committee member of
the Spatial Computing 2020 Workshop sponsored by NSF/CCC to promote a unified
agenda for Spatial Computing research
(http://cra.org/ccc/visioning/visioning-activities/spatial-computing/196-spatial-computing-workshop),
where Walid was the lead for the Systems track in the workshop. This has
resulted in an article that is currently being submitted to CACM in which Walid
is a co-author. In December 2012, Walid gave a keynote speech at the
International Conference for Connected Vehicles and Expo (Beijing, China) to
reflect advances in spatiotemporal database systems research. Walid continues
to chair the ACM SIGSPATIAL special interest group on spatial algorithms and
systems (2011-2014), where he was one of four founding members of the SIG.
2013-2014
Education and Course Design
Aref
has offered over 20 graduate and undergraduate research independent study and
theses courses (CS490, CS590, CS698, CS699) in Fall 2013, and Spring and Summer 2014. The purpose of these courses is to
train students individually or in small groups in various topics in
spatiotemporal data management research. Students learn advanced topics in
spatial and spatiotemporal data management through several interesting independent
study or group projects including distributed and parallel spatiotemporal data
streaming, benchmarking of spatiotemporal databases, declustered
spatiotemporal query processing, among many other topics. It is expected that
these studies to result in conference papers in the
coming academic year. Aref currently has one Masters Student and 12 Ph.D.
students (11 in CS, 1 in ECE) three of which were supported by this grant
including one woman Ph.D. student (Ruby Tahboub).
Another woman Ph.D. student is scheduled to join this project in Spring 2015.
Undergraduate Senior Student Linda Thongsavath did
her senior design project on the topic of benchmarking spatiotemporal databases
with deep query evaluation pipelines.
The computer science
education community has placed tremendous resources to raise interest in
computing and improve retention. A substantial share of these efforts relies on
developing innovative programming environments that target the young-aged
group. Along this line, we introduce LIMO; a web-based environment for learning
programming using activities related to interactive two-dimensional
geographical maps and the operations of moving objects on the maps. An
essential aspect of LIMO is that it provides diverse programming constructs
that enable users to process, customize, and visualize location data on the map.
In our view, the integration of interactive maps, location-based data, and
delivering seamless spatial constructs within an educational programming
environment is a very promising direction. We plan to develop LIMO further with
additional spatial and visual constructs to make it more attractive to novice
programmers. We plan to conduct a usability study that involves novice users to
assess the efficacy of using LIMO while learning programming. Finally, we plan
to develop a curriculum for an introductory computer science course based on
LIMO.
Aref
and two Ph.D. students have continued to work on developing LIMO based on
spatiotemporal data management techniques using 2D and 3D map and moving
objects' operations. In 2013/2014, we have developed 2D and 3D map-based
programming primitives and abstract operations around Google Maps and Google
Earth. These 2D and 3D map primitives serve as building blocks for the
map-based programming language that we are currently developing. Currently, we
are using these primitives to develop some simple programming exercises to
serve as a way to debug these primitives. To enrich this map-based programming
environment, we integrated an open-source map database (namely, OpenStreamMap) underneath the building blocks, so that the building
blocks can refer to the underlying spatial as well as the textual descriptions
of the spatial objects (e.g., rivers, lakes, shopping centers, etc.). The
programming exercises that we developed gave rise to interesting new research
problems that Ph.D. student Ruby Tahboub is currently
studying.
2014-2015:
In
2014/2015, Walid's research group has been composed
of 17 Ph.D. covering 5 continents (except for Australia :-) and including four
female students (covering 4 continents). In addition, over the past year, Walid
worked with 15 undergraduate students, three of which are female, where they
participated in big spatial data projects for their senior design project work.
Two of the undergraduate students co-authored two papers with Walid, and one
systems development paper for supporting big spatial data over Hadoop is in the works.
One
student (A. Aly) from Walid's
group graduated with Ph.D. from the group and joined Google as of August 2015.
One Ph.D. student passed his Ph.D. prelim exam and five other students will
have their prelim exam during this academic year.
2015-2016:
--
The second Ph.D. student (Mingjie Tang) has graduated
based on the research conducted within the scope of this project.
-- Over the
course of this project, Aref has offered over 50 graduate and undergraduate
research independent study and theses courses (CS390, CS490, CS590, CS698,
CS699). Around 20% of these course offerings were for female students. The
purpose of these courses was to train students individually or in small groups
in various topics in spatiotemporal data management research. Students learned
advanced topics in spatial and spatiotemporal data management through several
interesting independent study or group projects including distributed and
parallel spatiotemporal data streaming, benchmarking of spatiotemporal
databases, declustered spatiotemporal query
processing, among many other topics. Some of these studies resulted
in conference and journal papers.
2016-2017:
We have
maintained and released the website for the LIMO system. LIMO
Website: http://ibnkhaldun.cs.purdue.edu:8181/limo/GWTMaps.html
We have
introduced the notion of user accounts in LIMO so that users can log in
remotely through the web interface and code and store their LIMO programs in
our LIMO server.
We have published 16
journal papers in top-tier journal venues including the IEEE Transactions on
Knowledge and Data Engineering, The VLDB Journal, Communications of the ACM, GeoInformatica, Information Systems, Proceedings of the
VLDB Journal, and the IEEE Software magazine, as well as 21 papers in top-tier
conference venues including ACM SIGMOD, VLDB, IEEE ICDE, EDBT, ACM SIGSPATIAL,
SISAP, and WSDM, and one upcoming tutorial at the ACM SIGMOD (on the subject of
Spatial-keyword Query Processing Techniques).
As a way of disseminating
the results to communities of interest, Walid and his Ph.D. student gave a
tutorial on the subject of Spatial Keyword Big Data Processing at the 2017
SIGMOD Conference based on the research conducted in this project. Moreover, Walid
gave an a keynote speech at the 2017 SIGMOD GeoRich Workshop on the subject of "Location+X Big Data Systems: Challenges and Some
Solutions" based on the research conducted in this project.
Publications
13. A.M. Aly, W.G. Aref, M. Ouzzani, Cost
Estimation of Spatial k-Nearest-Neighbor Operators, EDBT 2015.
Collaborators:
· Dr. Paul Larson: Microsoft Research, Redmond, WA collaborated in writing a research
paper [VLDBJ2013] that provides foundational research for the correctness ideas
in this grant.
· Dr. M. Ouzzani: QCRI, was originally a co-PI in
the project when he was at Purdue and withdrew as co-PI after joining QCRI, but
continued to collaborate extensively in the context of this project
co-authoring many of the papers and co-advising one of the Ph.D. students (A.
Ali).
· Dr. M.H. Ali:
Microsoft Corporation, Redmond, Washington, collaborated in writing a research
paper [VLDBJ2013] that provides foundational research for the correctness ideas
in this grant.
· Dr. Y. Silva:
Arizona State University, collaborated in writing several papers in the context
of this project related to similarity query processing [VLDBJ 2013 and TKDE
2016].
· Dr. Hicham Elmongui: Amazon, collaborated in writing
a research paper [GeoInformatica2013] that presents algorithms for efficiently
evaluation aggregate nearest neighbor queries. Now at U. Alexandria, Egypt.
· Prof. M. Eltabakh: WPI, Massachusetts, collaborated
in writing a research paper [IEEE TKDE 2014] that presents models and
techniques for integrating human activities into database systems.
· Dr. Ahmed Elmagarmid: QCRI,
collaborated in writing a demo paper related to data cleaning [ICDE 2016].
· Prof. Mikhael Atallah: Purdue CS, collaborated in
writing a research paper [TKDE 2016].
· Prof. M. Gribskov: Purdue Biology, collaborated in
writing a research paper [TKDE 2016] providing real-world
driving applications from biology and providing relevant datasets.
· Graduate Students: A. Aly (Ph.D., 2015, Now at
Google), M. Tang (Ph.D., 2016, Now at Hortonworks),
A. Mahmood (Ph.D. student), R. Tahboub
(Ph.D. student), A. Madkour (Ph.D. student), M.
Hassan (Ph.D. student) were funded at different periods under this grant and
worked at various aspects of this projects.
· S. Pearson:
Graduate student at Arizona State who collaborated in the context of
generalizing the results in this project for similarity-based databases. Mr.
Pearson conducted experiments requested by the reviewers in order to revise the
submitted paper that was later accepted to appear in the VLDB Journal
[VLDBJ2013]. Now at Purdue.
· Aya A. Ismail: Graduate student at Maryland,
College Park. Collaborated in this project in the context of the LIMO system.
· B. Coffey:
Graduate student at Purdue University who collaborated in the context of
collecting meta-data about publicly available spatial data sets.
· Undergraduate Students: Linda Thongsavath: Undergraduate student at Purdue University
who collaborated in the context of realizing a benchmark for spatiotemporal
databases that use k-nearest neighbors in multi-predicate queries that include
relational, spatial, and kNN predicates.
· Visitors and Post-docs:
o
Dr. Shen
Zhihong: Chinese Academy of Sciences, Beijing, China (August 2013-
October 2013) participated in designing cluster-based user workspaces.
o
Dr. Saleh
Basalamah: Director of the GIS Technology Innovation Center, Umm
Al-Qura University, Saudi Arabia (April
2013-Present), participated in coauthoring multiple research papers, including
[DBCrowd2013], [IEEE BigData 2013], and [ACM
SIGSPATIAL 2014].
o
Dr. Eduard Dragut: Dr. Dragut
was a post-doc at Purdue University during 2012-2013 and participated in the
context of this project in the cloud sourcing indexing paper [DBCrowd 2013].