Data Management

Oceans of data are generated every day by businesses and enterprises, and all of it must be prioritized, analyzed, and safeguarded with the right architecture, tools, polices, and procedures. TechRepublic provides the resources you need.

  • White Papers // Sep 2011

    Efficient Rank Join with Aggregation Constraints

    The authors show aggregation constraints that naturally arise in several applications can enrich the semantics of rank join queries, by allowing users to impose their application-specific preferences in a declarative way. By analyzing the properties of aggregation constraints, they develop efficient deterministic and probabilistic algorithms which can push the aggregation...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Keyword Search on Form Results

    In recent years there has been a good deal of research in the area of keyword search on structured and semi-structured data. Most of this body of work has a significant limitation in the context of enterprise data since it ignores the application code that has often been carefully designed...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Massive Scale-out of Expensive Continuous Queries

    Scalable execution of expensive continuous queries over massive data streams requires input streams to be split into parallel sub-streams. The query operators are continuously executed in parallel over these sub-streams. Stream splitting involves both partitioning and replication of incoming tuples, depending on how the continuous query is parallelized. The authors...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Optimizing Probabilistic Query Processing on Continuous Uncertain Data

    Uncertain data management is becoming increasingly important in many applications, in particular, in scientific databases and data stream systems. Uncertain data in these new environments is naturally modeled by continuous random variables. An important class of queries uses complex selection and joins predicates and requires query answers to be returned...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Stratification Criteria and Rewriting Techniques for Checking Chase Termination

    The Chase is a fix point algorithm enforcing satisfaction of data dependencies in databases. Its execution involves the insertion of tuples with possible null values and the changing of null values which can be made equal to constants or other null values. Since the chase fix point evaluation could be...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Private Analysis of Graph Structure

    The authors present efficient algorithms for releasing useful statistics about graph data while providing rigorous privacy guarantees. Their algorithms work on data sets that consist of relationships between individuals, such as social ties or email communication. The algorithms satisfy edge differential privacy, which essentially requires that the presence or absence...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Profiling, What-if Analysis, and Costbased Optimization of MapReduce Programs

    MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Randomized Generalization for Aggregate Suppression Over Hidden Web Databases

    Many web databases are hidden behind restrictive form-like interfaces which allow users to execute search queries over the underlying hidden database. While it is important to support such search queries, many hidden database owners also want to maintain a certain level of privacy for aggregate information over their databases, for...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Publishing Set-Valued Data via Differential Privacy

    Set-valued data provides enormous opportunities for various data mining tasks. In this paper, the authors study the problem of publishing set-valued data for data mining tasks under the rigorous differential privacy model. All existing data publishing methods for set-valued data are based on partition-based privacy models, for example k-anonymity, which...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Storing Matrices on Disk: Theory and Practice Revisited

    The authors consider the problem of storing arrays on disk to support scalable data analysis involving linear algebra. They propose Linearized Array B-tree, or LAB-tree, which supports flexible array layouts and automatically adapts to varying sparsity across parts of an array and over time. They reexamine the B-tree splitting strategy...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Queries with Difference on Probabilistic Databases

    The authors study the feasibility of the exact and approximate computation of the probability of relational queries with difference on tuple-independent databases. They show that even the difference between two "Safe" conjunctive queries without self-joins is "Unsafe" for exact computation. They turn to approximation and design an FPRAS for a...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Where in the World is My Data?

    Users of websites such as Facebook, Ebay and Yahoo! demand fast response times, and these sites replicate data across globally distributed datacenters to achieve this. However, it is not necessary to replicate all data to all locations: if a European user's record is never accessed in Asia, it does not...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Optimizing and Parallelizing Ranked Enumeration

    Lawler-Murty's procedure is a general tool for designing algorithms for enumeration problems (i.e., problems that involve the production of a large set of answers in ranked order), which naturally arise in database management. Lawler-Murty's procedure is used in a variety of modern database applications; particularly in those related to keyword...

    Provided By VLD Digital

  • White Papers // Sep 2011

    OXPath: A Language for Scalable, Memory-Efficient Data Extraction from Web Applications

    The evolution of the web has outpaced itself: the growing wealth of information and the increasing sophistication of interfaces necessitate automated processing. Web automation and extraction technologies have been overwhelmed by this very growth. To address this trend, the authors identify four key requirements of web extraction: interact with sophisticated...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Linking Temporal Records

    Many data sets contain temporal records over a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at that particular time (e.g., author information in DBLP). In such cases, the authors often wish to identify records that describe the...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Optimistic Concurrency Control by Melding Trees

    In this paper, the authors describe a new optimistic concurrency control algorithm for tree-structured data called meld. Each transaction executes on a snapshot of a multi-version database and logs a record with its intended updates. Meld processes log records in log order on a cached partial-copy of the last committed...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Business Policy Modeling and Enforcement in Databases

    Database systems are the central information repositories for businesses and are subject to a wide array of policies, rules and requirements. The spectrum of business level constraints implemented within database systems has expanded from classical access control to include auditing, usage control, privacy management, and records retention. The lack of...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Dissemination of Models over Time-Varying Data

    Dissemination of time-varying data is essential in many applications, such as sensor networks, patient monitoring, stock tickers, etc. Often, the raw data have to go through some form of pre-processing, such as cleaning, smoothing, etc, be-fore being disseminated. Such pre-processing often applies mathematical or statistical models to transform the large...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions

    As a result of decades of research and industrial development, modern query optimizers are complex software artifacts. However, the quality of the query plan chosen by an optimizer is largely determined by the quality of the underlying statistical summaries. Small selectivity estimation errors, propagated exponentially, can lead to severely sub-optimal...

    Provided By VLD Digital

  • White Papers // Aug 2012

    Massively Parallel SortMerge Joins in Main Memory MultiCore Database Systems

    Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this paper,...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Data Coordination: Supporting Contingent Updates

    In many scenarios, a contingent data source may benefit by coordinating with external heterogeneous sources upon which it depends. The administrator of this contingent source needs to update it when changes are made to the external base sources. For example, when a building design is updated, the contractor's cost estimate...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Structure-Aware Sampling: Flexible and Accurate Summarization

    In processing large quantities of data, a fundamental problem is to obtain a summary which supports approximate query answering. Random sampling yields flexible summaries which naturally support subset-sum queries with unbiased estimators and well understood confidence bounds. Classic sample-based summaries, however, are designed for arbitrary subset queries and are oblivious...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Serializable Snapshot Isolation for Replicated Databases in High-Update Scenarios

    Many proposals for managing replicated data use sites running the Snapshot Isolation (SI) concurrency control mechanism, and provide 1-copy SI or something similar, as the global isolation level. This allows good scalability, since only ww-conflicts need to be managed globally. However, 1-copy SI can lead to data corruption and violation...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Approximate Substring Matching over Uncertain Strings

    Text data is prevalent in life. Some of this data is uncertain and is best modeled by probability distributions. Examples include biological sequence data and automatic ECG annotations, among others. Approximate substring matching over uncertain texts is largely an unexplored problem in data management. In this paper, the authors study...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Completeness of Queries over Incomplete Databases

    Data completeness is an important aspect of data quality as in many scenarios it is crucial to guarantee completeness of query answers. The authors develop techniques to conclude the completeness of query answers from information about the completeness of parts of a generally incomplete database. In their framework, completeness of...

    Provided By VLD Digital

  • White Papers // Sep 2011

    On Querying Historical Evolving Graph Sequences

    In many applications, information is best represented as graphs. In a dynamic world, information changes and so the graphs representing the information evolve with time. The authors propose that historical graph-structured data be maintained for analytical processing. They call a historical evolving graph sequence an EGS. They observe that in...

    Provided By VLD Digital

  • White Papers // Sep 2011

    On Link-based Similarity Join

    Graphs can be found in applications like social networks, bibliographic networks, and biological databases. Understanding the relationship, or links, among graph nodes enables applications such as link prediction, recommendation, and spam detection. In this paper, the authors propose Link-based Similarity join (LS-join), which extends the similarity join operator to link-based...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Entity Matching: How Similar Is Similar

    Entity matching that finds records referring to the same entity are an important operation in data cleaning and integration. Existing studies usually use a given similarity function to quantify the similarity of records, and focus on devising index structures and algorithms for efficient entity matching. However it is a big...

    Provided By VLD Digital

  • White Papers // Sep 2011

    PLP: Page Latch-Free Shared-Everything OLTP

    Scaling the performance of shared-everything transaction processing systems to highly-parallel multicore hardware remains a challenge for database system designers. Recent proposals alleviate locking and logging bottlenecks in the system, leaving page latching as the next potential problem. To tackle the page latching problem, the authors propose PhysioLogical Partitioning (PLP). The...

    Provided By VLD Digital

  • White Papers // Sep 2011

    On Pruning for Top-K Ranking in Uncertain Databases

    Top-k ranking for an uncertain database is to rank tuples in it so that the best k of them can be determined. The problem has been formalized under the unified approach based on Parameterized Ranking Functions (PRFs) and the possible world semantics. Given a PRF, one can always compute the...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Merging What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores

    Adaptive indexing is characterized by the partial creation and refinement of the index as side effects of query execution. Dynamic or shifting workloads may benefit from preliminary index structures focused on the columns and specific key ranges actually queried - without incurring the cost of full index construction. The costs...

    Provided By VLD Digital

  • White Papers // Sep 2011

    CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

    Hadoop has become an attractive platform for large-scale data analytics. In this paper, the authors identify a major performance bottleneck of Hadoop: its lack of ability to colocate related data on the same set of nodes. To overcome this bottleneck, they introduce CoHadoop, a lightweight extension of Hadoop that allows...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Distance-Constraint Reachability Computation in Uncertain Graphs

    Driven by the emerging network applications, querying and mining uncertain graphs has become increasingly important. In this paper, the authors investigate a fundamental problem concerning uncertain graphs, which they call the Distance-Constraint Reachability (DCR) problem: given two vertices s and t, what is the probability that the distance from s...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Efficiently Compiling Efficient Query Plans for Modern Hardware

    As main memory grows, query performance is more and more determined by the raw CPU costs of query processing itself. The classical iterator style query processing technique is very simple and flexible, but shows poor performance on modern CPUs due to lack of locality and frequent instruction mispredictions. Several techniques...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Recovering Semantics of Tables on the Web

    The Web offers a corpus of over 100 million tables, but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. The authors describe a system that attempts to recover the...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Surrogate Parenthood: Protected and Informative Graphs

    Many applications, including provenance and some analyses of social networks, require path-based queries over graph-structured data. When these graphs contain sensitive information, paths may be broken, resulting in uninformative query results. This paper presents innovative techniques that give users more informative graph query results; the techniques leverage a common industry...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Guided Interaction: Rethinking the Query-Result Paradigm

    Many decades of research, coupled with continuous increases in computing power, have enabled highly efficient execution of queries on large databases. In consequence, for many databases, far more time is spent by users formulating queries than by the system evaluating them. It stands to reason that, looking at the overall...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Resiliency-Aware Data Management

    Computing architectures change towards massively parallel environments with increasing numbers of heterogeneous components. The large scale in combination with decreasing feature sizes leads to dramatically increasing error rates. The heterogeneity further leads to new error types. Techniques for ensuring resiliency in terms of robustness regarding these errors are typically applied...

    Provided By VLD Digital

  • White Papers // Sep 2011

    Microsoft Codename "Montego" Data Import, Transformation, and Publication for Information Workers

    A fundamental problem in database systems is deriving useful information from untold quantities of data fragments that exist in the web's data stores. Data is abundant, useful information is rare. This problem space plays host too many successful and innovative solutions from industry, and the open-source community. Each solution has...

    Provided By VLD Digital

  • White Papers // Sep 2011

    AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables

    The authors present AIDA, a framework and online tool for entity detection and disambiguation. Given a natural-language text or a Web table, they map mentions of ambiguous names onto canonical entities like people or places, registered in a knowledge base like DBpedia, Freebase, or YAGO. AIDA is a robust framework...

    Provided By VLD Digital