So you won't allow logfiles to happen again?

VLM · on May 13, 2013

LOL, no, not as a primary store of data, no never again.

Plain text logs are a great place to funnel all "unable to connect to database" type of errors for dbas / sysadmins to ponder, however.

I've implemented quite a few systems where data changes are logged into a log table, all fully rationalized, so various reports can be generated about discrepancies and access rates and stuff like that. This is also cool for endusers to see who made the last change etc.

Trying to reverse engineer how some data got into a messed up condition using logs that can be JOINed to actual data tables as necessary is pretty easy, compared to trying to write a webserver log file parser to read multiple files to figure out who did what, when, to the data resulting in it being screwed up. You parse log files for data just once before you decide to do that stuff relationally. Debug time drops by a factor of 100x.

DenisM · on May 13, 2013

I'd like to pick your brain as that's the problem I'm facing right now - I have a web site that is accessed by users, and I would like to get a comprehensive picture of what they do. I already have a log table for logging all changes (as you said - I can show it to the users themselves so they know who hanged what in a collaborative environment), but I struggle defining meaningful way to log read access - should I record every hit? Would that be too much data to store and process later? Should I record some aggregates instead?

VLM · on May 13, 2013

Rather than aggregates, sampling.

Sounds like you're not interested in absolutely every event for security audits / data integrity audits and more interested in general workflow. Be careful when selecting samples to store because what superficially looks random might not be (I donno, DB primary key, or timestamp ends in :01?). Is (one byte read from /dev/urandom )== 0x42 if so log it this time.

Also it depends on if you're unable to log everything because of sheer volume or lack of usable purpose or ... For example if you've got the bandwidth to log everything but not to analyze it in anything near realtime, maybe log Everything for precisely one random minute per hour. So this hour everything that happens at 42 minutes gets logged, everything at minute 02 next hour, whatever. Better mod60 your random number.

You can also play games with hashes IF you have something unique per transaction and you have a really speedy hash then a great random sampler would be hashing that unique stuff and then only log if the hash ends in 0x1234 or whatever.

If you have multiple frontends, and you REALLY trust your load balancer, then just log everything on only one host.

I've found that storing data is usually faster and easier than processing it. Your mileage may vary.

Another thing I've run into is its really easy to fill a reporting table with indexes making reads really fast, while killing write performance. So make two tables, one index-less that accepts all raw data and one indexed to generate a specific set of reports, then periodically copy some sample outta the index free log table and into the heavily indexed report table.

Its kinda like the mentality of doing backups. Look at how sysadmins spend time optimizing doing a full backup tape dump and sometimes just dump a delta from the last backup.

You're going to have to cooperate with operations to see what level of logging overloads the frontends. There's almost no way to tell other than trying it unless you've got an extensive testing system.

I've also seen logging "sharded" off onto other machines. Lets say you have 10 front ends connecting to 5 back ends, so FE#3 and FE#4 read from BE#2 or whatever. I would not have FE3 and FE4 log/write to the same place they're reading BE2. Have them write to BE3 or something, anything than the one they're reading from. Maybe even a dedicated logging box so writing logs can never, ever, interfere with reading.

Another strategy I've seen which annoys the businessmen is assuming you're peak load limited, shut off logging at peak hour. OR, write a little thermostat cron job or whatever where if some measure of system load or latency exceeds X% then logging shuts off until 60 minutes in the future or something. Presumably you have a test suite/system/load tester that figured out you can survive X latency or X system load or X kernel level IO operations per minute, so if you exceed it, then all your FE flip to non-logging mode until it drops beneath the threshold. This is a better business plan because instead of explaining to "the man" that you don't feel like logging at their prime time, you can provide proven numbers that if they're willing to spend $X they could get enough drive bandwidth or whatever such that it would never go in log limiting mode. Try not to build an oscillator. Dampen it a bit. Like the more percentage you exceed the threshold in the past, the longer logging is silenced in the future, so at least if it does oscillate it won't flap too fast.

One interesting strategy for sampling to see if it'll blow up is sample precisely one hour. Or one frontend machine. And just see what happens before slowly expanding.

Its worth noting that unless you're already running at the limit of modern technology, something that would have killed a 2003 server is probably not a serious issue for a 2013 server. What was once (even recently) cutting edge can now be pretty mundane. What killed a 5400 RPM drive might not make a new SSD blink.

(Whoops edited to add I forgot to mention that you need to confer with decision makers about how many sig figs they need and talk to a scientist/engineer/statistician about how much data to obtain to generate those sig figs... if the decision makers actually need 3 sig figs, then storing 9 sig figs worth of data is a staggering financial waste but claiming 3 sig figs when you really only stored 2 is almost worse. I ran into this problem one expensive time.)

seestheday · on May 13, 2013

It sounds to me like you're trying to reinvent web analytics. Is there a reason you need user level granularity or is aggregate data enough?

DenisM · on May 13, 2013

My web site has separate "accounts" for multiple companies, each has multiple users. I'd like three level of analytics - for a given company (both me and the company agent would like to see this), across all companies for all users (I will see this), and rolled up within each company (i.e. higher-level of activity where it's companies are being tracked, not individual users).

User-level data might be useful for tech support (although this is currently working fine with text-based log files and a grep).

So I guess I am not sure... I might be content with web analytics... Each company has its own URL in the site, like so: http://blah.com/company/1001/ViewData, http://blah.com/company/1002/ViewData, etc. Using e.g. Google Analytics I could see data for one company easily, but can I see data across all companies (how many users look at ViewData regardless of company id)? Can I delegate the owner of company 1001 to see only the part of the analytics?

Another monkey wrench is the native iPad app - ideally the analytics would track users across both native apps and the web site.

geon · on May 14, 2013

> using logs that can be JOINed to actual data tables as necessary is pretty easy

Can you give me a concrete example of how you would use this?

VLM · on May 14, 2013

If by concrete you mean business case example, it comes up a lot in "the last three people to modify this record were..."

Security audit trail type stuff "Why is this salesguy apparently manually by hand downloading the entire master customer list alphabetically?".

This isn't all doom and gloom stuff either... "You're spending three weeks collecting the city/state of every customer so a marketing intern can plot an artsy graph by hand of 10000 customers using photoshop? OMG no, watch what I can do with google maps/earth and about 10 lines of perl in about 15 minutes" Or at least I can run a sql select that saves them about 50000 mouse clicks in about 2 minutes of work. Most "suit" types don't get the concept of a database and see it as a big expensive Excel where any query more complicated than select * is best done by a peon by hand. I've caught people manually alphabetizing database data in Word for example.

Another thing that comes up a lot in automation is treating a device differently WRT monitoring and alerting tools if a change was logged within the last 3 days. So your email alert for a monitored device reads contains a line something like "the last config change made was X hours ago by ...". Most of the time when X=0 or X=1 the alert is because ... screwed up, and when it isn't a short phone call to ... is problem isolation step #1.

This was all normal daily operations business use cases, aside from the usual theoretical data mining type stuff, like a user in A/B marketing test area "B" tends to update data table Q ten times more often than marketing test area "A" or whatever correlation seems reasonable (or not).

geon · on May 15, 2013

That sounds reasonable. I've been thinking about doing logging as a text blob on the affected object, but it haven't seemed useful enough.

Using your approach, I guess it would be a table like

    create table logs (
        tableName varchar,
        oldValue text,
        newValue text,
        userID int,
        when datetime
    );

Or did I miss something?