The following keywords have been assigned to this publication so far. If you have logged in,
you can tag this publication with additional keywords.
If you log in you can tag this publication with additional keywords
A publication can refer to another publication (outgoing references) or it can be referred to by other
publications (incoming references).
If you log in you can add references to other publications
A publication can be assigned to a conference, a journal or a school.
In this work we concentrate on categorization of relational
attributes based on their data type. Assuming that attribute
type/characteristics are unknown or unidentifiable, we analyze
and compare a variety of type-based signatures for classifying
the attributes based on the semantic type of the data contained
therein (e.g., router identifiers, social security numbers,
email addresses). The signatures can subsequently be used for
other applications as well, like clustering and index
optimization/compression. This application is useful in cases
where very large data collections that are generated in a
distributed, ungoverned fashion end up having unknown,
incomplete, inconsistent or very complex schemata and schema
level meta-data. We concentrate on heuristically generating
type-based attribute signatures based on both local and global
computation approaches. We show experimentally that by
decomposing data into q-grams and then considering signatures
based on q-gram distributions, we achieve very good
classification accuracy under the assumption that a large sample
of the data is available for building the signatures. Then, we
turn our attention to cases where a very small sample of the
data is available, and hence accurately capturing the q-gram
distribution of a given data type is almost impossible. We
propose techniques based on dimensionality reduction and
soft-clustering that exploit correlations between attributes to
improve classification accuracy.