Institute of Electrical & Electronic Engineers
The advent of new information sharing technologies has led the society to a scenario where thousands of textual documents are publicly published every day. The existence of confidential information in many of these documents motivates the use of measures to hide sensitive data before being published, which is precisely the goal of document sanitization. Even though methods to assist the sanitization process have been proposed, most of them are focused on the detection of specific types of sensitive entities for concrete domains, lacking generality and requiring from user supervision. Moreover, to hide sensitive terms, most approaches opt by removing them; a measure that hampers the utility of the sanitized document.