General discussion

Locked

27.3 MB text file?

By Jaqui ·
and it's only 27.3 MB AFTER I remove the log entries for any visitor not from TR on the log for my site.

March, 2008, the log alone for TR pulling my AV and some of the smileys posted is 27.3 MB.

This conversation is currently closed to new comments.

6 total posts (Page 1 of 1)  
| Thread display: Collapse - | Expand +

All Comments

Collapse -

yeah

by Jaqui In reply to 27.3 MB text file?

it works out to 7764 US Letter size pages.

took 35 minutes for Open Office Writer to fully open and page the file.

Collapse -

Got ya beat...

by Forum Surfer In reply to yeah

I once had someone hand me a file that wouldn't open. It was a packet capture output to text. A full 1 gig. I happened to have a monster of a server with a fresh install of Server 2003 R2. Beleive it or not, the beast opened the file without i hiccup using notepad of all things...nothing fancy. Granted, the server was meant for a GIS application so i bought it with 4 AMD Opteron 8220's 16 gigs of memory and a raid array of 300 gig 15,000RPM sas drives...a real moster of a server. I almost told the guy there's no way ANYTHING can open up a 1 gig text file, even if it's a 64 bit editor 1 gig is too much. Shocked me to see the file open and actually let me search it using notepad in a matter of minutes, if that!

Collapse -

oh, opening it in a text editor

by Jaqui In reply to Got ya beat...

is only 30 seconds, it was making the page breaks that took ooo writer so long.

I'm just amazed that TR pulling my av and a few emoticons from my site causes 27+ megs for the server log, in one month.

the files pulled total about 1.4 MB, so a couple of gigs of data transfer used for it.

Collapse -

You trying to impress us

by Dr Dij In reply to yeah

with how big your logfile is? :)

If you want to analyze log files it gets complicated.

editors ilke word will barf if > 20 to 40 megs.
be happy ur on unix, word can go into endless loops when I try to open huge files and number them to look at records.

I gave up using word and wrote a small prog:
flow:
1) parse out the fields -
IP - address of the bot or search engine or person browsing or proxy
referer or '-', (what website links to yours)
return status (200 = ok, 404 missing, 403 blocked, etc)
total bytes returned (more for jpegs, less for htm usually)
date + time (handy to see if robots hit a speed trap)
URL they are grabbing
UA (what type of computer, sw - useful for trapping bots)

You probably know the fields from looking at the log file.

2) identify bots based on UA (e.g. yahoo has 'slurp' in the UA), assign a flag as bot=true, and botname= (what you determined based on a lookup table or just hard-code them); also assign any IP addresses YOU use and put 'me' or 'me2' as bot name to avoid confusing with actual users, if you heavily browse your own pages. I do as I create talks from the content.
3) trap any '.com or '.edu's in the UA where botflag=false. this will allow you to add new bots & rerun
4) skip any record for any files you DON'T want to flag:
e.g. counter hits, small jpegs on every page, etc.
5) write to a database temp file, with multiple sorts and additional summations; some of the sorts and summations you only write if certain conditions are true or false (see sorts below) and some you add into a record, reading it first and adding to byte / file access counts)
6) when all recs processed, read the database temp file end to end, all the different rec types, output a .tab delimited file, with name of '.xls' instead of .tab. This way, you just double click on them and opens in excel (or linux equivalent). You could output to dift files. I insert a blank line when sort keys change.

You can have any sorts and summations in the db temp file.
I currently have these output sorts: (each output sorts shows all the input fields plus any count or byte sums)
imaro + IP address - I assign this text sort key if they hit my robot trap - an html file xref that is not visible but a robot will blindly follow. any that hit your robot trap should have a spider name assigned. if blank then you can see the IP address that is unknown robot. if not in the UA on the same line you'll have to make up a name for them.

IP + totbytes: a nice sum with counter of #files hit and totalbytes. Nice for finding page scraper bots, or 'Hueys' (Heavy Users)

IP + URL (NOBOTS BAD) - write this sort only if botflag=false and they have a 403 or 404 status or hit your 'missing page' 404.shtml page

IP + URL (NOBOTS) - shows what pages users hit by user IP address, good for showing heavy users and scrapers. I found one scraper had grabbed 515 megs before I 403'd him. Noticed as every page sequentially on one site for quite a few pages was scraped.

URL (NOBOTS) - I write this sort keyed by URL only to identify popular pages. on output to .xls I delete any with just '1' in hits. You can then sort by descending hits when you open in spreadsheet

REFER TO URL - this sums by referring url. You can also do 'REFER URL + URL' to show what external pages linke to what of your pages

BOTNAME + URL - here you've lumped together by botname since they often hit you from multiple IPs, all sorted together here.

You can also sum up BOTNAME + IP to show what IPs getting hit from in case you want to ban a bot

IP + UA or BOTNAME + UA if you want to see if an IP has multiple UAs (indicates a proxy or a bot changing UAs)

you can also analyze by time of day for one IP. I had some IP I was trying to figure out if a bot, repeatedly hitting site. showed what hour of day they kept hitting site.

also you can ID scrapers by looking for sequential file access, if people typically DON'T access every page on your site, a bot would show up this way.

and you u probably know how to lookup the whois info.

I scan for hack attempts (they can't get in but worth banning someone trying to subvert .php functions , typically show long URL with cgi or php stuff under 404 pages not found)

if you want to get heavily into log stuff check the sites on webmasterworld. incredibill has his own blog too on blogspot with some good ideas.

Also, save your montly raw logs, combine them and you'll get a better picture of access.

It sure is nice to see access WITHOUT bots, something that webalizer and analog DON'T do.

You can setup the log analysis prog to input an IP address or partial address or botname at the beginning. If blank run whole file. if not blank then search for that botname in the assigned spider or the UA or search for the input IP address in the parsed out address, and skip the record if not present and running for just one IP or bot.

When you run for just one IP or Bot, append that run string to the output file name. then you'll have sept spr created that is just that site.

Have fun!

Collapse -

have you

by Jaqui In reply to You trying to impress us

ever looked at AWStats?

it breaks down the logs with that info, plus, search engine terms, words.
use of robots.txt or not for the bots
visitor country of origin
os type of visitor
bowser used

will list very ile accessed, and how much data transfer used

all in one nice html formatted page in the browser :)

Collapse -

the prob with awstats

by Dr Dij In reply to have you

and analog, w ebalizer
is they can't identify robots very well.
they get the large ones like google
but some bots don't identify themselves in the UA so you need to ID them yourself.

plus they don't filter out stuff I don't want to see (e.g. endless 1 page hits, I have 44 gigs online, and the bots are trolling it heavily)

the country of origin is nice. so are the hit terms used (what search engine terms refered them to my pages). I COULD parse them out but every search engine sends it differently in the referer string so there are somethings they do nicely.

Also due to my site size, I actually cannot browse some results html pages. they are too big and browser hangs, finally crashes, sometimes taking my computer down with it. This is due to limitations on memory or something else both in IE and firefox.

Back to Software Forum
6 total posts (Page 1 of 1)  

Related Discussions

Related Forums