TY - GEN

T1 - Approximately detecting duplicates for probabilistic data streams over sliding windows

AU - Wang, Xiujun

AU - Shen, Hong

PY - 2010

Y1 - 2010

N2 - A probabilistic data stream S is defined as a sequence of uncertain tuples < ti, pi >, i = 1..∞, with the semantics that element ti occurs in the stream with probability pi ε (0, 1). Thus each distinct element t, which occurs in tuples of S, has an existential probability based on the tuples: < ti = t, pi >ε S. Existing duplicate detection methods for a traditional deterministic data stream can't maintain these existential probabilities for elements in S, which is important query information. In this paper, we present a novel data structure, Floating Counter Bloom Filter (FCBF), as an extension of CBF [1], which can maintain these existential probabilities effectively. Based on FCBF, we present an efficient algorithm to approximately detect duplicates for probabilistic data streams over sliding windows. Given a sliding window size W and floating counter number N, for any t which occurs in the past sliding window, our method outputs the accurate existential probability of t with probability 1-(1/2) ln(2)*N/W. Our experimental results on the synthetic data verify the effectiveness of our approach.

AB - A probabilistic data stream S is defined as a sequence of uncertain tuples < ti, pi >, i = 1..∞, with the semantics that element ti occurs in the stream with probability pi ε (0, 1). Thus each distinct element t, which occurs in tuples of S, has an existential probability based on the tuples: < ti = t, pi >ε S. Existing duplicate detection methods for a traditional deterministic data stream can't maintain these existential probabilities for elements in S, which is important query information. In this paper, we present a novel data structure, Floating Counter Bloom Filter (FCBF), as an extension of CBF [1], which can maintain these existential probabilities effectively. Based on FCBF, we present an efficient algorithm to approximately detect duplicates for probabilistic data streams over sliding windows. Given a sliding window size W and floating counter number N, for any t which occurs in the past sliding window, our method outputs the accurate existential probability of t with probability 1-(1/2) ln(2)*N/W. Our experimental results on the synthetic data verify the effectiveness of our approach.

KW - Counting bloom filter

KW - Duplicate detection

KW - False positive

KW - Floating counter bloom filter

KW - Probabilistic data stream

UR - http://www.scopus.com/inward/record.url?scp=79952561571&partnerID=8YFLogxK

U2 - 10.1109/PAAP.2010.16

DO - 10.1109/PAAP.2010.16

M3 - Conference contribution

AN - SCOPUS:79952561571

SN - 9780769543123

T3 - Proceedings - 3rd International Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2010

SP - 263

EP - 268

BT - Proceedings - 3rd International Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2010

T2 - 3rd International Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2010

Y2 - 18 December 2010 through 20 December 2010

ER -