This function searches text or HTML files for URLs embedded in the text
and returns them in an array. CAUTION: If used as a web spider to collect
links on sites other than your own, use sleep()/usleep() to slow it down
and avoid overloading other servers! Development was done under WindowsNT,
but I can't see why it won't work under Unix.
<?php
# NOTES:
#
# Beware of setting the $depth parameter too high; this can cause lengthy
# searches and return very large result arrays! (Once again, CAUTION: If used
# as a web spider to collect links on sites other than your own, MODIFY THIS
# CLASS to use sleep()/usleep() to slow it down and avoid overloading other
# servers!)
#
# My main use for this function is to periodically scan web forum postings and
# summarize any URL mentions buried deep in the message threads. On Unix, this
# can be set up as a cron process; under WinNT, the command-line "AT" command
# or WinAT can be used for scheduling.
#
# Please excuse the formatting; I collapsed all tabs to single spaces, so some
# things aren't aligned nicely.
#
# If you have any suggestions/comments, email me: sbedberg@ucdavis.edu
#
# AND NOW, THE ACTUAL CODE:
?>
<?php
#------------------------------------------------------------------------------
# CLASS:
# link_harvester
# Scans text or HTML documents for URLs, and returns a list.
#------------------------------------------------------------------------------
# Notes:
# If the given filename is an actual file or a URL, it will scan that file;
# if it is a directory, it will look for and scan ALL files in that
# directory.
#
# If scanning a text file, will only recognize URLs with the "schema://..."
# form; newlines, tabs, whitespace and some punctuation (',",<,>) terminate
# it. Any links that are followed (depth > 0) will be followed as HTML, not
# as text files. The doc title array will contain the the first line of the
# file, up to 80 chars. Lastly, all links are assumed to be non-local (too
# hard to determine otherwise).
#
# If scanning HTML, the routine only looks at HREFs. Anything without an
# explicit scheme is - as is specified by the W3 standard - assumed to be
# a reference to a local page.
#
# Public Methods:
# harvest($filename) Wrapper for _harvest()
# $filename This is the initial filename/URL to start scanning
#
# Private Methods:
# _harvest($filename, $depth, $as_text)
# Actual search routine
# $filename As above.
# $depth Depth of documents to search (see public vars, below)
# $as_text Determines how to search $filename (see public vars)
#
# _is_schema($string) Returns TRUE if a valid schema is found at the head
# of string (whitespace characters not permitted).
#
#------------------------------------------------------------------------------
# (version 0.9) written by S. Edberg (sbedberg@ucdavis.edu), March 1998.
#
# Change History:
#------------------------------------------------------------------------------
# TODO?
# BUGFIX: If (depth > 0 and local searching enabled), the base URL needs to
# be prepended to the $url agrument to _harvest($url, $depth-1); current-
# ly, it returns a fail-to-open error.
# Have optional list of URLs to include or exclude
# Duplicate URL removal option
# Add timeout options (for entire script? for attempts to open URLs found?)
# Allow wildcards in file specification
# Optionally traverse subdirectories if a directory name is specified
#------------------------------------------------------------------------------
class link_harvester {
# Public variables: Input
var $depth = 0; # Link depth to search (0= $filename only, 1=1 level deep, ...)
var $as_text = 1; # If 1 (TRUE) scans docs as text; 0 (FALSE) - assumes HTML
var $local_sw = 1; # If TRUE, collect local links; ignore local otherwise
var $remote_sw = 1; # If TRUE, collect nonlocal link info; ignore otherwise
# Public variables: Output
var $attempts = 0; # No. of files attempted
var $failures = 0; # No. of files unreachable
var $lasterror = ''; # Last error message generated (excluding unreachable files)
var $links; # Array of links harvested
var $titles; # Array of titles ($title[x] is for $link[x])
var $source; # Array of source URLs of links($source[x] for $link[x])
# Private variables:
var $MAXLINELEN = 2048; # Maximum length of input line.
# Public methods
function harvest($filename) {
$ftype = filetype($filename);
if ($this->_is_schema($filename) || $ftype == 'file') {
$in_url = 0; # state variable: are we currently in a URL?
$in_link = 0; # state variable; are we in linked text?
$first_time = 1;
$getfunc = ($as_text ? "fgetss" : "fgets");
# ...strip HTML, PHP tags if reading as text file.
$prev_line_fragment = '';
while($line = $getfunc($fp, $this->MAXLINELEN)) {
$remaining_line = $prev_line_fragment.$line;
$prev_line_fragment = '';
# ...prepend end of previous line in case there was a read-line
# break in the middle of a tag
if ($as_text && $first_time) { $title = substr($line,0,80); }
while ($remaining_line != '') {
if ($as_text) {
$temp = split( '://', $remaining_line, 2);
if (count($temp) > 1) { # '://' found; now extract schema
# Found; $temp[1] contains rest of line, starting at schema
$in_url = 1;
$in_link = 0; # state varb: are we in linked text? (else, in URL)
$url = '';
$title = '';
$remaining_line = $temp[1]; # start search over with new, truncated line.