3 scanmail, testscan \- spam filters
14 .I sender system rcpt-list
29 accepts a mail message supplied on standard input,
30 applies a file of patterns to a portion of it,
34 It exactly replaces the
35 generic queuing command
37 that is executed from the
41 in the mail processing pipeline.
42 Associated with each pattern is an
44 in order of decreasing priority:
48 the message is deleted and a log entry is written to
52 the message is placed in a queue for human inspection
55 a line containing the matching portion of the message is written to a log
58 If no pattern matches or only patterns with an action of
60 match, the message is accepted and
62 queues the message for delivery.
64 meshes with the blocking facilities
67 to provide several layers of
68 filtering on gateway systems. In all cases the sender
69 is notified that the message has been successfully
71 leaving the sender unaware that the message has been potentially delayed or deleted.
74 accepts the arguments of
76 as well as the following:
80 Save a copy of each message in a
81 randomly-named file in
86 Write debugging information to standard error.
91 messages by sending domain name.
94 option must specify a root directory; messages
95 are queued in subdirectories of this directory.
98 option is not specified,
99 messages are accumulated in a subdirectory of
101 named for the contents of
108 Messages are never held for inspection, but are delivered. Also known as
109 .IR "vacation mode" .
112 Read the patterns from
115 .BR /mail/lib/patterns .
118 Queue deliverable messages in subdirectories of
120 This option is the same as the
124 and must be present if the
130 messages. Messages are stored, one per randomly-named file,
136 Test mode. The pattern matcher is applied but the message is
137 discarded and the result is not logged.
140 Print the highest priority match.
144 option for testing the pattern matcher without actually
149 is the command line version of
153 is missing, it applies the pattern set to
154 the message on standard input. Unlike
156 which finds the highest priority match,
158 prints all matches in the portion of the message under test.
159 It is useful for testing a pattern set or
160 implementing a personal filter
163 file in a user's mail directory.
165 accepts the following options:
168 Print matches in the complete input message
174 Print the message after conversion to canonical form
178 Read the patterns from
181 .BR /mail/lib/patterns .
183 Before pattern matching, both programs convert a portion of
184 the message header and the beginning of the
185 message to a canonical form. The amount of the header
186 and message body processed are set by
187 compile-time parameters in the source files.
188 The canonicalization process converts letters to lower-case and
189 replaces consecutive spaces, tabs and newline characters
190 with a single space. HTML commands are
191 deleted except for the parameters following
199 directives. Additionally, the following MIME escape sequences
200 are replaced by their ASCII
215 assembles the sender, destination domain and recipient fields of
216 the command line into a string that is
217 subjected to the same canonical processing.
218 Following canonicalization, the command line and
219 the two long strings containing
220 the header and the message body are passed to the
221 matching engine for analysis.
223 The matching engine compiles the pattern set
224 and matches it to each canonicalized input string.
225 Patterns are specified one per line
229 {*}\fIaction\fP: \fIpattern-spec\fP {~~\fIoverride\fP...~~\fIoverride\fP}
234 introduces a comment; there is no way to escape this character.
240 that is a string; otherwise, the the
242 is a regular expression in the style of
244 Regular expression matching is many
245 times less efficient than string matching, so it is
246 wiser to enumerate several similar strings
247 than to combine them into a regular expression.
250 is a keyword terminated by a
252 and separated from the pattern by optional white-space.
253 It must be one of the following:
256 if the pattern matches, the message is deleted. If the
258 command line option is set, the message is saved.
261 if the pattern matches, the message is queued in a subdirectory
264 for manual inspection. After inspection, the queue can be swept
269 to deliver messages that were inadvertently matched.
272 this is the same as the
274 action, except the pattern is only applied to the message header.
275 This optimization is useful for patterns that match header fields
276 that are unlikely to be present in the body of the message.
279 the sender and a section of the message around the match are written to
282 The message is always delivered.
285 patterns of this type are applied only to the canonicalized command line.
286 When a match occurs, all patterns with
288 actions are disabled. This is useful for limiting
289 the size of the log file by excluding repetitive messages, such
290 as those from mailing lists.
292 Patterns are accumulated into pattern sets sharing the same action.
293 The matching engine applies the
295 pattern set first, then the
299 pattern sets, and finally the
301 pattern set. Each pattern set is applied three times:
302 to the canonicalized command line, to the message header, and
303 finally to the message body. The ordering of patterns
304 in the pattern file is insignificant.
308 is a string of characters terminated by a
311 or override indicator,
313 Trailing white-space is deleted but
314 patterns containing leading or trailing white-space can
315 be enclosed in double-quote
316 characters. A pattern containing a double-quote
317 must be enclosed in double-quote
318 characters and preceded by a backslash.
319 For example, the pattern
322 "this is not \\"spam\\""
325 matches the string \fLthis is not "spam"\fP.
328 is followed by zero or more
330 strings. When the specific pattern matches,
331 each override is applied and
332 if one matches, it cancels the effect of the pattern.
333 Overrides must be strings; regular expressions are not supported.
334 Each override is introduced by the string
336 and continues until a subsequent
341 white-space included.
344 immediately followed by a
346 indicates a line continuation and further overrides continue
347 on the following line.
349 on the continuation line is ignored. For example,
352 *hold: sex.com~~essex.com~~sussex.com~~sysex.com~~
353 lasex.com~~cse.psu.edu!owner-9fans
356 matches all input containing the string
358 except for messages that also contain the
359 strings in the override list. Often it
360 is desirable to override a pattern based on
361 the name of the sender or
362 recipient. For this reason, each override
363 pattern is applied to the header and the command line as well
364 as the section of the
365 canonicalized input containing the matching data.
366 Thus a pattern matching the command line or the header
367 searches both the command line and the header
368 for overrides while a match in the body searches
369 the body, header and command line for overrides.
371 The structure of the pattern file and the matching
372 algorithm define the strategy for detecting
373 and filtering unwanted messages. Ideally, a
375 pattern selects a message for inspection and if it
376 is determined to be undesirable, a specific
378 pattern is added to delete further instances
379 of the message. Additionally, it is often
380 useful to block the sender by updating the
384 In this regime, patterns with a
386 action, generally match phrases
387 that are likely to be unique. Patterns that
388 hold a message for inspection
389 match phrases commonly found in undesirable material and
390 occasionally in legitimate messages. Patterns
391 that log matches are less specific yet. In all
392 cases the ability to override a pattern by
393 matching another string, allows repetitive messages
394 that trigger the pattern, such as mailing lists,
395 to pass the filter after the first one is processed
398 option allows deleted messages to be salvaged
399 by either manual or semi-automatic review, supporting
400 the specification of more aggressive patterns.
401 Finally, the utility of the pattern matcher is not
402 confined to filtering spam; it is a generally useful
403 administrative tool for deleting inadvertently harmful
404 messages, for example, mail loops, stuck senders or viruses.
405 It is also useful for collecting or counting messages
406 matching certain criteria.
408 .TF /mail/queue.dump/*
410 .B /mail/lib/patterns
414 log of deleted messages
422 directories where legitimate messages are queued for delivery
425 directory where held messages are queued for inspection
427 .B /mail/queue.dump/*
430 messages are stored when the
432 command line option is specified.
435 directory where copies of all incoming messages
439 .B /sys/src/cmd/upas/scanmail
446 does not report a match when the body of a message
447 contains exactly one line.