Read this primer on using Splunk as a big data analysis tool. You'll learn how to analyze data and filter results, as well as create visual results from that data.
Splunk takes machine data in the form of log files, or other computer-generated text files, and turns it into operational intelligence.
Machine data (log files), as data goes, is not particularly exciting as compared to say, human genomic information, but Splunk turns our "digital exhaust" into electronic gold for those who are willing to spend a little time to examine it. In fact, machine data is the fastest growing area of big data in enterprises; it includes every user transaction, every system message, every suspicious activity, and every machine-to-machine interaction. And believe it or not, it's exciting to see what's happening on your network. Splunk's powerful Search Processing Language (SPL), originally based on the Unix pipeline and SQL, makes data analysis much more fun and much less complex than you'd think.
SEE: Auditing and logging policy (Tech Pro Research)
The big data part of Splunk comes from the log files, events, and other information that you send to it either directly or via forwarders. The Enterprise version begins its entry level licensing at 5 GB of data per day. For some enterprises that is a huge amount of data and you might never reach that 5 GB limit. In larger enterprises that host system farms, 5 GB might be an hourly average for log files, so you'll have to increase your licensing limit to accommodate the volume of data you need to ingest.
How to analyze data and filter results in Splunk
Once you have Splunk Enterprise up and running and acquiring data, now it's time to tap into its big data analytics engine. To begin using Splunk's analytics, you must acquire data by adding so-called data inputs into your Splunk system. You may acquire data by uploading, monitoring, and forwarding. Some data types do not work using the upload feature, such as Windows event logs (.evt or evtx); Windows events are best configured for ingest by installing the universal forwarding client.
SEE: How to build a successful data scientist career (free PDF) (TechRepublic)
Though your total amount of collected data can be very large, Splunk's SPL provides data users with a Google-like search experience that includes search term auto-completion and a highly configurable filtering capability. For example, you can enter a single keyword, such as a host's name (adserver1), into the New Search field and view all results that match your keyword. A broad search can yield thousands or hundreds of thousands of results.
To filter these results, enter more specific information into the New Search field, such as host="adserver1" sourcetype="activedirectory". This refined search will narrow returned results into a more manageable dataset that only includes Active Directory-related events.
To further filter the search results, you can enter: host="adserver1" sourcetype="activedirectory" "admonEventType=Start". This query returns a subset of data that only contains Start events from adserver1's Active Directory events.
How to create visual results from data in Splunk
Producing lines of events is little better than reading those events from their original locations; however, Splunk's SPL provides the capability to generate meaningful information from those results in the form of statistical data.
Using the example from above, you might want to find out how many Active Directory "Update" events you generate over a selected time period. Any Active Directory object that changes creates an Update admonEventType entry in Splunk. Knowing the number of Active Directory changes occurring over a period of time can clue you in to potential malicious activity after you establish a reasonable change baseline for your environment.
To view statistical data, you have to pipe your query results the "stats" function.
Sourcetype="activedirectory" "admonEventType=Update" | stats count by host
These results can also tell you if your Active Directory servers are all replicating changes across your domain. The Update admonEventType should increment by one on every Active Directory server for each update made in the Active Directory environment.
If you're a more visual person, rather than a raw data or numbers type, you can use the Visualization tab to display your data in different formats using Splunk's Visualizations. You can select from line graphs, scatter diagrams, pie charts, bar graphs, and other standard data visualization options (Figure A).
You can sort your results to create more dramatic results. For example, if you wish to display a column chart that emphasizes a data point at the beginning or at the end of the visual range, then you can apply the sort command in descending or ascending order. The following example displays results in descending order, placing the largest values at the top of the list or at the first of a column chart or at the top of a bar graph (Figure B).
Sourcetype="activedirectory" "admonEventType=Update" | stats count by host | sort desc
You can further beautify a result set by renaming the generic axes labels by piping the results to the rename command.
Sourcetype="activedirectory" "admonEventType=Update" | stats count by host | sort desc | rename count as "AD Changes"
This changes the generic count label to AD Changes, and this label carries over to any visual representation of the data such as a column chart (Figure C).
How to learn more about Splunk
This brief introduction to Splunk as a big data analysis tool only scratches the surface of Splunk's analytic capabilities, but hopefully it's a good start for anyone who feels overwhelmed by learning a new data query language. Splunk's documentation is very thorough, and you can find free training on Splunk's website and extensive Splunk-created content on YouTube.
- Splunk acquires SignalSense, beefs up machine learning, security expertise (ZDNet)
- Ultimate Data & Analytics Bundle (TechRepublic Academy)
- Report: 59% of employed data scientists learned skills on their own or via a MOOC (TechRepublic)
- Data visualization: When it's the wrong tool for the job (TechRepublic)
- Video: The big data ownership land grab is on (TechRepublic)