Apache log analysis with Mathematica

Back in January of 2009, I wrote an article for the Wolfram Research blog describing how the we use Mathematica internally for web analytics. Since then, several people have contacted me asking how they might use Mathematica for their own websites. Due to the scale of Wolfram Research’s web operations, the system I described requires a substantial server-side backend. I’m going to walk through some examples of how Mathematica’s built-in features can be used to analyze a smaller volume of web server access logs.

For a sample data set, you can use the logs provided by Tim Bray for his Wide Finder Project. I suggest importing it as text, so that you can try out the various import elements for the "ApacheLog" format without having to re-download (or even reload a local file) each time.

$apachelogs = Import["http://www.tbray.org/tmp/o10k.ap", "Text"];

You’ll need a few utility functions for parsing URLs and the timestamp strings used in the Common Log Format. While you could write a string pattern or regular expression for URL parsing, it’s easier to just use J/Link and the Java standard library.

Needs["JLink`"]

parseURL[url_String] :=
    JavaBlock@Module[
        {URL},
        Quiet@Check[
            URL = JavaNew["java.net.URL", url];
            {
                "Protocol" -> URL@getProtocol[],
                "Host" -> URL@getHost[],
                "Path" -> URL@getPath[],
                "Query" -> URL@getQuery[]
            } /. (Null -> Missing[]),
            $Failed (* if Java threw any exceptions *)
        ]
    ]

The timestamp format is just complicated enough that DateList needs a little help.

parseDatetime[datetime_String] :=
    DateList[{StringDrop[datetime, -6], (* drop the timezone *)
        {"Day","/","MonthNameShort","/","Year",":",
        "Hour",":","Minute",":","Second"}}
    ]

Evaluating ImportString[apachelogs, {"ApacheLog", "Elements"}] shows that the following elements are available:

  • ByteCount
  • Date
  • Referrer
  • RemoteHost
  • RemoteUser
  • RequestLine
  • StatusCode
  • UserAgent

A quick review of the Wikipedia article about HTTP should make it clear what each of these elements means.

The most basic question someone can ask about web traffic is “how much of it was there?” The usual metric is page views: the number of times that individual pages of content were requested from the web server. This isn’t as simple as just counting requests in the log file, because requests for images, JavaScript files, and stylesheets will all be included. These resources are all requested alongside the main content pages and will need to be filtered.

You should construct a list of filename extensions which should not be considered pages, and remove those requests from the logs.

$excluded = Alternatives@@{"atom","rss","png","css","js","jpg","xml","ico","gif"};

$apachelogs = StringJoin@Riffle[
    Select[
        ImportString[$apachelogs, {"Text", "Lines"}],
        StringFreeQ[#, "GET "~~__~~$excluded~~" HTTP"]&
    ],
    "\n"
];

Once the logs are filtered down to valid page requests, parse the date strings and tally up the requests by minute.

$pageviews = Tally@Map[
    parseDateTime[#][[1;;5]]&,
    ImportString[$apachelogs, {"ApacheLog","Date"}]
];

Finally, plot the results.1

pageviews to tbray.org

The profile of a site’s traffic sources can be very useful for understanding its audience. Does traffic come from search engines, from direct navigation, or from social news sites? By ordering the site’s referring domains by page views, you can see how at a high level how your audience is finding your content.

Some high profile sites, especially Google, use per-country domain names. When trying to distinguish search traffic from social news site traffic, those geographical distinctions aren’t useful. The function that extracts the domain from the referrer is a good place to apply labels grouping all the search engine variants.

referrerDomain[referringURL_String] :=
    Module[
        {parsedURL},
        parsedURL = parseURL[referringURL];
        If[ parsedURL === $Failed,
            "No referrer",
            StringReplace["Host" /. parsedURL,
                {
                    ___~~".google."~~___ -> "Google Search",
                    ___~~".msn."~~___ -> "Microsoft Search",
                    ___~~".yahoo."~~___ -> "Yahoo! Search"
                }
            ]
        ]
    ]

Import the"Referrer" element, tally up the domains, and take the top 10.

    (Reverse@SortBy[
        DeleteCases[
            Tally@Map[
                referrerDomain,
                ImportString[$apachelogs, {"ApacheLog", "Referrer"}]
            ],
            {"www.tbray.org", _} (* ignore within-site referrers *)
        ],
        Last
    ])[[1;;10]]
ReferrerPageviews
No referrer1675
Google Search114
lambda-the-ultimate.org34
www.bloglines.com7
blogs.sun.com7
www.stumbleupon.com5
Microsoft Search5
www.angelfire.com4
bloglines.com4
Yahoo! Search3

There appears to be an unusual amount of direct traffic to Tim Bray’s site. There’s probably an explanation, but I’ll leave finding it as an exercise for the reader.


  1. The plot was generated using DateListPlot.

    DateListPlot[$pageviews, Joined -> True,
        ImageSize -> 440, AspectRatio -> 1/10,
        PlotRange -> All, Frame -> False,
        AxesOrigin -> {$pageviews[[1,1]], 0},
        BaseStyle -> {FontFamily -> "Arial", RGBColor[.2,.2,.2]}
    ]