Text joins for data cleansing and integration in a...

G - Physics – 06 – F

Patent

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

G06F 17/30 (2006.01) G06F 7/02 (2006.01)

Patent

CA 2464927

An organization's data records are often noisy: because of transcription errors, incomplete information, and lack of standard formats for textual data. A fundamental task during data cleansing and integration is matching strings -perhaps across multiple relations- that refer to the same entity (e.g., organization name or address). Furthermore, it is desirable to perform this matching within an RDBMS, which is where the data is likely to reside. In this paper, We adapt the widely used and established cosine similarity metric from the information retrieval field to the relational database context in order to identify potential string matches across relations. We then use this similarity metric to characterize this key aspect of data cleansing and integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose an approximate, sampling-based approach to the join problem that can be easily and efficiently executed in a standard, unmodified RDBMS. Therefore the present invention includes a system for string matching across multiple relations in a relational database management system comprising generating a set of strings from a set of characters, decomposing each string into a subset of tokens, establishing at least two relations within the strings, establishing a similarity threshold for the relations, sampling the at least two relations, correlating the relations for the similarity threshold and returning all of the tokens which meet the criteria of the similarity threshold.

LandOfFree

Say what you really think

Search LandOfFree.com for Canadian inventors and patents. Rate them and share your experience with other people.

Rating

Text joins for data cleansing and integration in a... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Text joins for data cleansing and integration in a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Text joins for data cleansing and integration in a... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFCA-PAI-O-1749709

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.