VLD Digital

Displaying 1-40 of 435 results

  • White Papers // Mar 2014

    Towards Building Wind Tunnels for Data Center Design

    Data center design is a tedious and expensive process. Recently, this process has become even more challenging as users of cloud services expect to have guaranteed levels of availability, durability and performance. A new challenge for the service providers is to find the most cost-effective data center design and configuration...

    Provided By VLD Digital

  • White Papers // Mar 2014

    A Principled Approach to Bridging the Gap between Graph Data and their Schemas

    Although RDF graph data often come with an associated schema, recent studies have proven that real RDF data rarely conform to their perceived schemas. Since a number of data management decisions, including storage layouts, indexing, and efficient query processing, use schemas to guide the decision making, it is imperative to...

    Provided By VLD Digital

  • White Papers // Mar 2014

    String Similarity Joins: An Experimental Evaluation

    String similarity join is an important operation in data integration and cleansing that finds similar string pairs from two collections of strings. More than ten algorithms have been proposed to address this problem in the recent two decades. However, existing algorithms have not been thoroughly compared under the same experimental...

    Provided By VLD Digital

  • White Papers // Mar 2014

    An Efficient Publish/Subscribe Index for E-Commerce Databases

    Many of todays publish/subscribe (pub/sub) systems have been designed to cope with a large volume of subscriptions and high event arrival rate (velocity). However, in many novel applications (such as e-commerce), there is an increasing variety of items, each with different attributes. This leads to a very high-dimensional and sparse...

    Provided By VLD Digital

  • White Papers // Mar 2014

    Calibrating Data to Sensitivity in Private Data Analysis

    The authors present an approach to differentially private computation in which one does not scale up the magnitude of noise for challenging queries, but rather scales down the contributions of challenging records. While scaling down all records uniformly is equivalent to scaling up the noise magnitude, they show that scaling...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Rank Join Queries in NoSQL Databases

    Cloud stores have become the storage of choice for a large variety of big data producers, consumers, and managers (e.g., Twitter, Facebook, Google, Amazon, etc.) For many modern Big Data applications, RDBMSs were found lacking, particularly with respect to scalability (in terms of number of data items, users, operations per...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Lightweight Indexing of Observational Data in Log-Structured Storage

    Huge amounts of data are being generated by sensing devices every day, recording the status of objects and the environment. Such observational data is widely used in scientific research. As the capabilities of sensors keep improving, the data produced are drastically expanding in precision and quantity, making it a write-intensive...

    Provided By VLD Digital

  • White Papers // Feb 2014

    GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph

    Mining frequent subgraphs is an important operation on graphs; it is defined as finding all subgraphs that appear frequently in a database according to a given frequency threshold. Most existing work assumes a database of many small graphs, but modern applications, such as social networks, citation graphs, or protein-protein interactions...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML

    Large-scale data analytics have become an integral part of online services, enterprise data management, system management, and scientific applications in order to gain value from huge amounts of collected data. Finding interesting unknown facts and patterns often requires analyzing the full data set instead of applying sampling techniques. Recent approaches...

    Provided By VLD Digital

  • White Papers // Feb 2014

    epiC: an Extensible and Scalable System for Processing Big Data

    The big data problem is characterized by the so called 3V features: Volume - a huge amount of data, Velocity - a high data ingestion rate, and Variety - a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions to the big data problem are largely based...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Optimizing Graph Algorithms on Pregel-Like Systems

    The authors study the problem of implementing graph algorithms efficiently on Pregel-like systems, which can be surprisingly challenging. Standard graph algorithms in this setting can incur unnecessary inefficiencies such as slow convergence or high communication or computation cost, typically due to structural properties of the input graphs such as large...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Schemaless and Structureless Graph Querying

    Querying complex graph databases such as knowledge graphs is a challenging task for non-professional users. Due to their complex schemas and variational information descriptions, it becomes very hard for users to formulate a query that can be properly processed by the existing systems. The authors argue that for a user-friendly...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Toward Computational Fact-Checking

    In this paper, the authors have shown how to turn fact-checking into a computational problem. Interestingly, by regarding claims as queries with parameters, they can check them - not just for correctness, but more importantly, for more subtle measures of quality - by perturbing their parameters. This observation leads the...

    Provided By VLD Digital

  • White Papers // Jan 2014

    Shared Workload Optimization

    As a result of increases in both the query load and the data managed, as well as changes in hardware architecture (multi-core), the last years have seen a shift from query-at-a-time approaches towards Shared Work (SW) systems where queries are executed in groups. Such groups share operators like scans and...

    Provided By VLD Digital

  • White Papers // Jan 2014

    Support the Data Enthusiast: Challenges for Next-Generation Data-Analysis Systems

    The authors present a vision of next-generation visual analytics ser-vices. They argue that these services should have three related capabilities: support visual and interactive data exploration as they do today, but also suggest relevant data to enrich visualizations, and facilitate the integration and cleaning of that data. Most importantly, they...

    Provided By VLD Digital

  • White Papers // Jan 2014

    A Provenance Framework for Data-Dependent Process Analysis

    A Data-Dependent Process (DDP) models an application who-se control flow is guided by a finite state machine, as well as by the state of an underlying database. DDPs are commonly found e.g., in e-commerce. In this paper, the authors develop a framework supporting the use of provenance in static (temporal)...

    Provided By VLD Digital

  • White Papers // Jan 2014

    Tracking Entities in the Dynamic World: A Fast Algorithm for Matching Temporal Records

    Identifying records referring to the same real world entity over time enables longitudinal data analysis. However, difficulties arise from the dynamic nature of the world: the entities described by a temporal data set often evolve their states over time. While the state of the art approach to temporal entity matching...

    Provided By VLD Digital

  • White Papers // Jan 2014

    Edelweiss: Automatic Storage Reclamation for Distributed Programming

    Event Log Exchange (ELE) is a common programming pattern based on immutable state and messaging. ELE sidesteps traditional challenges in distributed consistency, at the expense of introducing new challenges in designing space reclamation protocols to avoid consuming unbounded storage. The authors introduce Edelweiss, a sublanguage of bloom that provides an...

    Provided By VLD Digital

  • White Papers // Dec 2013

    Exemplar Queries: Give me an Example of What You Need

    Search engines are continuously employing advanced techniques that aim to capture user intentions and provide results that go beyond the data that simply satisfy the query conditions. Examples include the personalized results, related searches, similarity search, popular and relaxed queries. In this paper, the authors introduce a novel query paradigm...

    Provided By VLD Digital

  • White Papers // Dec 2013

    Reverse Top-k Search using Random Walk with Restart

    With the increasing popularity of social networks, large volumes of graph data are becoming available. Large graphs are also derived by structure extraction from relational, text, or scientific data (e.g., relational tuple networks, citation graphs, ontology networks, protein-protein interaction graphs). Node-to-node proximity is the key building block for many graph-based...

    Provided By VLD Digital

  • White Papers // Dec 2013

    Write-limited Sorts and Joins for Persistent Memory

    To mitigate the impact of the widening gap between the memory needs of CPUs and what standard memory technology can deliver, system architects have introduced a new class of memory technology termed persistent memory. Persistent memory is byte addressable, but exhibits asymmetric I/O: writes are typically one order of magnitude...

    Provided By VLD Digital

  • White Papers // Dec 2013

    MaaT: Effective and Scalable Coordination of Distributed Transactions in the Cloud

    The past decade has witnessed an increasing adoption of cloud database technology, which provides better scalability, availability, and fault-tolerance via transparent partitioning and replication, and automatic load balancing and fail-over. However, only a small number of cloud databases provide strong consistency guarantees for distributed transactions, despite decades of research on...

    Provided By VLD Digital

  • White Papers // Dec 2013

    A Data and Workload-Aware Algorithm for Range Queries Under Differential Privacy

    The authors describe a new algorithm for answering a given set of range queries under differential privacy which often achieves substantially lower error than competing methods. Their algorithm satisfies differential privacy by adding noise that is adapted to the input data and to the given query set. They first privately...

    Provided By VLD Digital

  • White Papers // Dec 2013

    Certain Query Answering in Partially Consistent Databases

    A database is called uncertain if two or more tuples of the same relation are allowed to agree on their primary key. Intuitively, such tuples act as alternatives for each other. A repair (or possible world) of such uncertain database is obtained by selecting a maximal number of tuples without...

    Provided By VLD Digital

  • White Papers // Dec 2013

    Computing k-Regret Minimizing Sets

    Regret minimizing sets are a recent approach to representing a dataset D by a small subset R of size r of representative data points. The set R is chosen such that executing any top-1 query on R rather than D is minimally perceptible to any user. However, such a subset...

    Provided By VLD Digital

  • White Papers // Dec 2013

    Folk-IS: Opportunistic Data Services in Least Developed Countries

    According to a wide range of studies, IT should become a key facilitator in establishing primary education, reducing mortality and supporting commercial initiatives in Least Developed Countries (LDCs). The main barrier to the development of IT services in these regions is not only the lack of communication facilities, but also...

    Provided By VLD Digital

  • White Papers // Nov 2013

    High Performance Stream Query Processing With Correlation-Aware Partitioning

    State-of-the-art optimizers produce one single optimal query plan for all stream data, in spite of such a singleton plan typically being sub-optimal or even poor for highly correlated data. Recently a new stream processing paradigm, called multi-route approach, has emerged as a promising approach for tackling this problem. Multi-route first...

    Provided By VLD Digital

  • White Papers // Nov 2013

    Scalable Discovery of Unique Column Combinations

    The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling e ort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization....

    Provided By VLD Digital

  • White Papers // Nov 2013

    Delta: Scalable Data Dissemination Under Capacity Constraints

    In content-based publish-subscribe (pub/sub) systems, users express their interests as queries over a stream of publications. Scaling up content-based pub/sub to very large numbers of subscriptions is challenging: users are interested in low latency, that is, getting subscription results fast, while the pub/sub system provider is mostly interested in scaling,...

    Provided By VLD Digital

  • White Papers // Nov 2013

    Optimization for Iterative Queries on MapReduce

    The authors propose OptIQ, a query optimization approach for iterative queries in distributed environment. OptIQ removes redundant computations among different iterations by extending the traditional techniques of view materialization and incremental view evaluation. First, OptIQ decomposes iterative queries into invariant and variant views, and materializes the former view. Redundant computations...

    Provided By VLD Digital

  • White Papers // Nov 2013

    OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases

    Benchmarking is an essential aspect of any DataBase Management System (DBMS) effort. Despite several recent advancements, such as pre-configured cloud database images and DataBase-as-a-Service (DBaaS) offerings, the deployment of a comprehensive testing platform with a diverse set of datasets and workloads is still far from being trivial. In many cases,...

    Provided By VLD Digital

  • White Papers // Nov 2013

    Gestural Query Specification

    Direct, ad-hoc interaction with databases has typically been performed over console-oriented conversational interfaces using query languages such as SQL. With the rise in popularity of gestural user interfaces and computing devices that use gestures as their exclusive modes of interaction, database query interfaces require a fundamental rethinking to work without...

    Provided By VLD Digital

  • White Papers // Nov 2013

    SeeDB: Visualizing Database Queries Efficiently

    Data scientists rely on visualizations to interpret the data returned by queries, but finding the right visualization remains a manual task that is often laborious. The authors propose a DBMS that partially automates the task of finding the right visualizations for a query. In a nutshell, given an input query...

    Provided By VLD Digital

  • White Papers // Oct 2013

    A Partition-Based Approach to Structure Similarity Search

    Graphs are widely used to model complex data in many applications, such as bioinformatics, chemistry, social net-works, pattern recognition, etc. A fundamental and critical query primitive is to efficiently search similar structures in a large collection of graphs. This paper studies the graph similarity queries with edit distance constraints. Existing...

    Provided By VLD Digital

  • White Papers // Oct 2013

    From \"Think Like a Vertex\" to \"Think Like a Graph\"

    To meet the challenge of processing rapidly growing graph and network data created by modern applications, a number of distributed graph processing systems have emerged, such as Pregel and GraphLab. All these systems divide input graphs into partitions, and employ a \"Think like a vertex\" programming model to support iterative...

    Provided By VLD Digital

  • White Papers // Oct 2013

    Online Ordering of Overlapping Data Sources

    Data integration systems offer a uniform interface for querying a large number of autonomous and heterogeneous data sources. Ideally, answers are returned as sources are queried and the answer list is updated as more answers arrive. Choosing a good ordering in which the sources are queried is critical for increasing...

    Provided By VLD Digital

  • White Papers // Oct 2013

    Multi-Query Optimization in MapReduce Framework

    MapReduce has recently emerged as a new paradigm for large-scale data analysis due to its high scalability, fine-grained fault tolerance and easy programming model. Since different jobs often share similar work (e.g., several jobs s-can the same input file or produce the same map output), there are many opportunities to...

    Provided By VLD Digital

  • White Papers // Oct 2013

    Attraction and Avoidance Detection from Movements

    With the development of positioning technology, movement data has become widely available nowadays. An important task in movement data analysis is to mine the relationships among moving objects based on their spatiotemporal interactions. Among all relationship types, attraction and avoidance are arguably the most natural ones. However, rather surprisingly, there...

    Provided By VLD Digital

  • White Papers // Oct 2013

    Highly Available Transactions: Virtues and Limitations

    To minimize network latency and remain online during server failures and network partitions, many modern distributed data storage systems eschew transactional functionality, which provides strong semantic guarantees for groups of multiple operations over multiple data items. In this work, the authors consider the problem of providing Highly Available Transactions (HATs):...

    Provided By VLD Digital

  • White Papers // Oct 2013

    Probabilistic Nearest Neighbor Queries on Uncertain Moving Object Trajectories

    Nearest Neighbor (NN) queries in trajectory databases have received significant attention in the past, due to their applications in spatiotemporal data analysis. More recent work has considered the realistic case where the trajectories are uncertain; however, only simple uncertainty models have been proposed, which do not allow for accurate probabilistic...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Hadoop's Adolescence

    The authors analyze Hadoop workloads from three different research clusters from a user-centric perspective. The goal is to better understand data scientists' use of the system and how well the use of the system matches its design. Their analysis suggests that Hadoop usage is still in its adolescence. They see...

    Provided By VLD Digital

  • White Papers // Sep 2010

    FlashStore: High Throughput Persistent Key-Value Store

    The authors present FlashStore, a high throughput persistent key value store that uses flash memory as a non-volatile cache between RAM and hard disk. FlashStore is designed to store the working set of key-value pairs on flash and use one flash read per key lookup. As the working set changes...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Probabilistic Query Rewriting for Efficient and Effective Keyword Search on Graph Data

    The problem of rewriting keyword search queries on graph data has been studied recently, where the main goal is to clean user queries by rewriting keywords as valid tokens appearing in the data and grouping them into meaningful segments. The main solution to this problem employs heuristics for ranking query...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters

    A table placement method is a critical component in big data analytics on distributed systems. It determines the way how data values in a two-dimensional table are organized and stored in the underlying cluster. Based on hadoop computing environments, several table placement methods have been proposed and implemented. However, a...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia

    Bloom filters are a great technique to test whether a key is not in a set of keys. This paper presents a novel data structure called ARF. In a nutshell, ARFs are for range queries what Bloom filters are for point queries. That is, an ARF can determine whether a...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Scaling Queries Over Big RDF Graphs With Semantic Hash Partitioning

    Massive volumes of big RDF data are growing beyond the performance capacity of conventional RDF data management systems operating on a single node. Applications using large RDF data demand efficient data partitioning solutions for supporting RDF data access on a cluster of compute nodes. In this paper the authors present...

    Provided By VLD Digital

  • White Papers // Aug 2013

    An Experimental Analysis of Iterated Spatial Joins in Main Memory

    Many modern applications rely on high-performance processing of spatial data. Examples include location-based services, games, virtual worlds, and scientific simulations such as molecular dynamics and behavioral simulations. These applications deal with large numbers of moving objects that continuously sense their environment, and their data access can often be abstracted as...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis

    Large-scale graph analysis is becoming important with the rise of world-wide social network services. Recently in SociaLite, the authors proposed extensions to Datalog to efficiently and succinctly implement graph analysis programs on sequential machines. This paper describes novel extensions and optimizations of SociaLite for parallel and distributed executions to support...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Query-Driven Approach to Entity Resolution

    The significance of data quality research is motivated by the observation that the effectiveness of data-driven technologies such as decision support tools, data exploration, analysis, and scientific discovery tools is closely tied to the quality of data to which such techniques are applied. This paper explores \"On-the-fly\" data cleaning in...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Efficient Bulk Updates on Multiversion B-trees

    Partial persistent index structures support efficient access to current and past versions of objects, while updates are allowed on the current version. The MultiVersion B-Tree (MVBT) represents a partially persistent index-structure with both, asymptotic worst-case performance and excellent performance in real life applications. Updates are performed tuple-by-tuple with the same...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Counting and Sampling Triangles from a Graph Stream

    Triangle counting has emerged as an important building block in the study of social networks, identifying thematic structures of networks, spam and fraud detection, link classification and recommendation, and more. This paper presents a new space-efficient algorithm for counting and sampling triangles - and more generally, constant-sized cliques - in...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Expressiveness and Complexity of Order Dependencies

    Dependencies play an important role in databases. The authors study Order Dependencies (ODs) - and Unidirectional Order Dependencies (UODs), a proper sub-class of ODs - which describe the relationships among lexicographical orderings of sets of tuples. They consider lexicographical ordering, as by the order-by operator in SQL, because this is...

    Provided By VLD Digital

  • White Papers // Sep 2013

    Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs

    Horton+ is a graph query processing system that executes declarative reachability queries on a partitioned attributed multi-graph. It employs a query language, query optimizer, and a distributed execution engine. The query language expresses declarative reachability queries, and supports closures and predicates on node and edge attributes to match graph paths....

    Provided By VLD Digital

  • White Papers // Aug 2013

    Discovering Denial Constraints

    Integrity Constraints (ICs) provide a valuable tool for enforcing correct application semantics. However, designing ICs requires experts and time. Proposals for automatic discovery have been made for some formalisms, such as functional dependencies and their extension conditional functional dependencies. Unfortunately, these dependencies cannot express many common business rules. For example,...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Multi-Tuple Deletion Propagation: Approximations and Complexity

    In this paper the authors study the computational complexity of the classic problem of deletion propagation in a relational database, where tuples are deleted from the base relations in order to realize a desired deletion of tuples from the view. Such an operation may result in a (sometimes unavoidable) side...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Supporting Distributed Feed-Following Apps over Edge Devices

    In feed-following applications such as Twitter and Facebook, users (consumers) follow a large number of other users (producers) to get personalized feeds, generated by blending producers' feeds. With the proliferation of cloud-connected smart edge devices such as Smartphone, producers and consumers of many feed-following applications reside on edge devices and...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases

    Tabular data on the web has become a rich source of structured data that is useful for ordinary users to explore. Due to its potential, tables on the web have recently attracted a number of studies with the goals of understanding the semantics of those web tables and providing effective...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Universal Indexing of Arbitrary Similarity Models

    The increasing amount of available unstructured content together with the growing number of large non-relational databases put more emphasis on the content-based retrieval and precisely on the area of similarity searching. Although there exist several indexing methods for efficient querying, not all of them are best-suited for arbitrary similarity models....

    Provided By VLD Digital

  • White Papers // Aug 2013

    Summarizing Answer Graphs Induced by Keyword Queries

    Various methods were developed to process keyword queries. In practice, these methods typically generate a set of graphs G induced by Q. Keyword search has been popularly used to query graph data. Due to the lack of structure support, a keyword query might generate an excessive number of matches, referred...

    Provided By VLD Digital

  • White Papers // Aug 2013

    A Probabilistic Optimization Framework for the Empty-Answer Problem

    The authors propose a principled optimization-based interactive query relaxation framework for queries that return no answers. Given an initial query that returns an empty answer set, their framework dynamically computes and suggests alternative queries with less conditions than those the user has initially requested, in order to help the user...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Counter Strike: Generic Top-Down Join Enumeration for Hypergraphs

    Finding the optimal execution order of join operations is a crucial task of today's cost-based query optimizers. There are two approaches to identify the best plan: bottom-up and top-down join enumeration. But only the top-down approach allows for branch and-bound pruning, which can improve compile time by several orders of...

    Provided By VLD Digital

  • White Papers // Aug 2013

    A Sampling Algebra for Aggregate Estimation

    As of 2005, sampling has been incorporated in all major database systems. While efficient sampling techniques are realizable, determining the accuracy of an estimate obtained from the sample is still an unresolved problem. In this paper, the authors present a theoretical framework that allows an elegant treatment of the problem....

    Provided By VLD Digital

  • White Papers // Aug 2013

    QuEval: Beyond High-Dimensional Indexing a La Carte

    In the recent past, the amount of high-dimensional data, such as feature vectors extracted from multimedia data, increased dramatically. A large variety of indexes have been proposed to store and access such data efficiently. However, due to specific requirements of a certain use case, choosing an adequate index structure is...

    Provided By VLD Digital

  • White Papers // Jan 2014

    Shared Workload Optimization

    As a result of increases in both the query load and the data managed, as well as changes in hardware architecture (multi-core), the last years have seen a shift from query-at-a-time approaches towards Shared Work (SW) systems where queries are executed in groups. Such groups share operators like scans and...

    Provided By VLD Digital

  • White Papers // Jan 2014

    Support the Data Enthusiast: Challenges for Next-Generation Data-Analysis Systems

    The authors present a vision of next-generation visual analytics ser-vices. They argue that these services should have three related capabilities: support visual and interactive data exploration as they do today, but also suggest relevant data to enrich visualizations, and facilitate the integration and cleaning of that data. Most importantly, they...

    Provided By VLD Digital

  • White Papers // Jan 2014

    A Provenance Framework for Data-Dependent Process Analysis

    A Data-Dependent Process (DDP) models an application who-se control flow is guided by a finite state machine, as well as by the state of an underlying database. DDPs are commonly found e.g., in e-commerce. In this paper, the authors develop a framework supporting the use of provenance in static (temporal)...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Rank Join Queries in NoSQL Databases

    Cloud stores have become the storage of choice for a large variety of big data producers, consumers, and managers (e.g., Twitter, Facebook, Google, Amazon, etc.) For many modern Big Data applications, RDBMSs were found lacking, particularly with respect to scalability (in terms of number of data items, users, operations per...

    Provided By VLD Digital

  • White Papers // Jan 2014

    Tracking Entities in the Dynamic World: A Fast Algorithm for Matching Temporal Records

    Identifying records referring to the same real world entity over time enables longitudinal data analysis. However, difficulties arise from the dynamic nature of the world: the entities described by a temporal data set often evolve their states over time. While the state of the art approach to temporal entity matching...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Lightweight Indexing of Observational Data in Log-Structured Storage

    Huge amounts of data are being generated by sensing devices every day, recording the status of objects and the environment. Such observational data is widely used in scientific research. As the capabilities of sensors keep improving, the data produced are drastically expanding in precision and quantity, making it a write-intensive...

    Provided By VLD Digital

  • White Papers // Feb 2014

    GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph

    Mining frequent subgraphs is an important operation on graphs; it is defined as finding all subgraphs that appear frequently in a database according to a given frequency threshold. Most existing work assumes a database of many small graphs, but modern applications, such as social networks, citation graphs, or protein-protein interactions...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML

    Large-scale data analytics have become an integral part of online services, enterprise data management, system management, and scientific applications in order to gain value from huge amounts of collected data. Finding interesting unknown facts and patterns often requires analyzing the full data set instead of applying sampling techniques. Recent approaches...

    Provided By VLD Digital

  • White Papers // Feb 2014

    epiC: an Extensible and Scalable System for Processing Big Data

    The big data problem is characterized by the so called 3V features: Volume - a huge amount of data, Velocity - a high data ingestion rate, and Variety - a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions to the big data problem are largely based...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Optimizing Graph Algorithms on Pregel-Like Systems

    The authors study the problem of implementing graph algorithms efficiently on Pregel-like systems, which can be surprisingly challenging. Standard graph algorithms in this setting can incur unnecessary inefficiencies such as slow convergence or high communication or computation cost, typically due to structural properties of the input graphs such as large...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Schemaless and Structureless Graph Querying

    Querying complex graph databases such as knowledge graphs is a challenging task for non-professional users. Due to their complex schemas and variational information descriptions, it becomes very hard for users to formulate a query that can be properly processed by the existing systems. The authors argue that for a user-friendly...

    Provided By VLD Digital

  • White Papers // Mar 2014

    A Principled Approach to Bridging the Gap between Graph Data and their Schemas

    Although RDF graph data often come with an associated schema, recent studies have proven that real RDF data rarely conform to their perceived schemas. Since a number of data management decisions, including storage layouts, indexing, and efficient query processing, use schemas to guide the decision making, it is imperative to...

    Provided By VLD Digital

  • White Papers // Feb 2014

    Toward Computational Fact-Checking

    In this paper, the authors have shown how to turn fact-checking into a computational problem. Interestingly, by regarding claims as queries with parameters, they can check them - not just for correctness, but more importantly, for more subtle measures of quality - by perturbing their parameters. This observation leads the...

    Provided By VLD Digital

  • White Papers // Mar 2014

    String Similarity Joins: An Experimental Evaluation

    String similarity join is an important operation in data integration and cleansing that finds similar string pairs from two collections of strings. More than ten algorithms have been proposed to address this problem in the recent two decades. However, existing algorithms have not been thoroughly compared under the same experimental...

    Provided By VLD Digital

  • White Papers // Mar 2014

    An Efficient Publish/Subscribe Index for E-Commerce Databases

    Many of todays publish/subscribe (pub/sub) systems have been designed to cope with a large volume of subscriptions and high event arrival rate (velocity). However, in many novel applications (such as e-commerce), there is an increasing variety of items, each with different attributes. This leads to a very high-dimensional and sparse...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Finding Shortest Paths on Terrains by Killing Two Birds with One Stone

    With the increasing availability of terrain data, e.g., from aerial laser scans, the management of such data is attracting increasing attention in both industry and academia. In particular, spatial queries, e.g., k-nearest neighbor and reverse nearest neighbor queries, in Euclidean and spatial network spaces are being extended to terrains. Such...

    Provided By VLD Digital

  • White Papers // Aug 2013

    Incremental and Accuracy-Aware Personalized PageRank Through Scheduled Approximation

    As Personalized PageRank Vector (PPV) has been widely leveraged for ranking on a graph, the efficient computation of Personalized PageRank Vector (PPV) becomes a prominent issue. In this paper, the authors propose FastPPV, an approximate PPV computation algorithm that is incremental and accuracy-aware. Their approach hinges on a novel paradigm...

    Provided By VLD Digital