This article includes alist of references,related reading, orexternal links,but its sources remain unclear because it lacksinline citations. Please helpimprove this article byintroducing more precise citations.(November 2025) (Learn how and when to remove this message) |
Aninformation filtering system is a system that removesredundant or unwantedinformation from an information stream using (semi)automated or computerized methods prior to presentation to a human user. Its main goal is the management of theinformation overload and increment of thesemanticsignal-to-noise ratio. To do this the user's profile is compared to some reference characteristics. These characteristics may originate from the information item (the content-based approach) or the user's social environment (thecollaborative filtering approach).
Whereas ininformation transmissionsignal processing filters are used againstsyntax-disrupting noise on the bit-level, the methods employed in information filtering act on the semantic level.
The range of machine methods employed builds on the same principles as those forinformation extraction. A notable application can be found in the field of emailspam filters. Thus, it is not only theinformation explosion that necessitates some form of filters, but also inadvertently or maliciously introducedpseudo-information.
On the presentation level, information filtering takes the form of user-preferences-basednewsfeeds, etc.
Recommender systems andcontent discovery platforms are active information filtering systems that attempt to present to the user information items (film,television,music,books,news,web pages) the user is interested in. These systems add information items to the information flowing towards the user, as opposed to removing information items from the information flow towards the user. Recommender systems typically usecollaborative filtering approaches or a combination of the collaborative filtering and content-based filtering approaches, although content-based recommender systems do exist.
Before the advent of theInternet, there are already several methods offiltering information; for instance, governments may control and restrict the flow of information in a given country by means of formal or informal censorship.
Another example of information filtering is the work done by newspaper editors and journalists who provide a service that selects the most valuable information for their clients i.e readers of books, magazines, newspapers,radio listeners andTV viewers. This filtering operation is also present in schools and universities where there is a selection of information to provide assistance based on academic criteria to customers of this service, the students. With the advent of the Internet it is possible for anyone to publish anything they wish at a low-cost. Because of this, the quantity of less useful information has increased considerably and consequently quality of information has improved/disseminated. Due to this problem, work to devise information filtering to obtain the information required for each specific topic easily and efficiently began.
A filtering system of this style consists of several tools that help people find the most valuable information, so the limited time you can dedicate to read / listen / view, is correctly directed to the most interesting and valuable documents. These filters are also used to organize and structure information in a correct and understandable way, in addition to group messages on the mail addressed. These filters are essential in the results obtained of thesearch engines on the Internet. The functions of filtering improves every day to get downloading Web documents and more efficient messages.
One of the criteria used in this step is whether theknowledge is harmful or not, whether knowledge allows a better understanding with or without the concept. In this case the task ofinformation filtering to reduce or eliminate the harmful information with knowledge.
A system of learning content consists, in general rules, mainly of three basic stages:
Currently the problem is not finding the best way tofilter information, but the way that these systems require to learn independently the information needs of users. Not only because they automate the process offiltering but also the construction and adaptation of the filter. Some branches based on it, such as statistics, machine learning, pattern recognition and data mining, are the base for developing information filters that appear and adapt in base to experience. To carry out the learning process, part of the information has to be pre-filtered, which means there are positive and negative examples which we named training data, which can be generated by experts, or viafeedback from ordinary users.
As data is entered, the system includes new rules; if we consider that this data can generalize the training data information, then we have to evaluate the system development and measure the system's ability to correctly predict the categories of newinformation. This step is simplified by separating the training data in a new series called "test data" that we will use to measure the error rate. As a general rule it is important to distinguish between types of errors (false positives and false negatives). For example, in the case on an aggregator of content for children, it doesn't have the same seriousness to allow the passage of information not suitable for them, that shows violence or pornography, than the mistake to discard some appropriated information.To improve the system to lower error rates and have these systems with learning capabilities similar to humans we require development of systems that simulate human cognitive abilities, such asnatural-language understanding, capturing meaning Common and other forms of advanced processing to achieve the semantics of information.
Nowadays, there are numerous techniques to develop information filters, some of these reach error rates lower than 10% in various experiments.[citation needed] Among these techniques there are decision trees, support vector machines, neural networks, Bayesian networks, linear discriminants, logistic regression, etc..At present, these techniques are used in different applications, not only in the web context, but in thematic issues as varied as voice recognition, classification of telescopic astronomy or evaluation of financial risk.