Send a message
2 years ago
I would like to share some information, about web archiving and before purchasing a solution for compliance, preservation, BI etc.
let me know what you think!
Web archiving Buyers guide:
Introduction: Web archiving is fairly new compared to other data retention practices that have been existing for a long period of time.
That makes it even harder for a consumer to gather the necessary data and
make a smart purchasing decision. This guide contains basic information about web archiving, lists some of the most popular providers on the market and sets up some criteria and choices Different web archiving approaches
Web archiving has made significant progress during the last five to seven years.
It now offers a choice of approach to both policy and supporting technology.
These choices should be considered carefully against business objectives before the decision is made. The main differences lie in the capture and access methods used.
Three different methods exist to capture and archive web content:
a. client-side archiving
b. transaction archiving
c. server-side archiving
?? Client-side archiving ?? uses an archival crawler, derived from search engine
crawler technologies, with significant enhancements to ensure that complex and hard-to-reach content can be found and captured, as well as stored without change. Starting from seed pages or entry points, these tools automatically capture pages and parse them to extract all links. The process repeats and continues as long as newly discovered pages remain within the scope defined for the crawl. The captured web content and embedded files are stored unchanged
??? original and authentic copies, an exact equivalent
of what the generic user would have received in their browser at the time ??? and preserved in a flat, standards-based and self- contained file format that can be confidently considered as future proof.
This is especially important within a legal context.
To be effective this method requires a crawler with excellent link extraction and path-finding algorithms that can work in a wide range of circumstances and site/page designs. In addition to client-side archiving, there are two alternative methods to capture web content. Both methods need to be operated from the server-side; require prior authorisation to services; and need access to both front-end and back-end servers.
The first of these alternative methods, called ?? transaction archiving ??, consists of the systematic capture and archiving of all browser/server exchanges (request/response pairs), resulting from the interaction of users with sites, regardless of their content type and how they are produced. Transaction archiving enables tracking and recording of every actual instantiation of content in an authentic flat HTML form, easy to maintain and preserve over time.
Moreover, it can be used to archive hidden web content, provided this content is requested, i.e. read, by the websites??? users during the capture time.
However, transaction archiving generates unnecessary duplicates of frequentlyvisited pages and raises serious privacy concerns as the method implicitly relies on usage tracking.
The second, and more obvious, alternative to client side archiving is ??serverside archiving ??. This consists of directly copying files in the document folders to back-up servers. Although it might appear to be the simplest approach, it is in fact seriously flawed, from both the preservation and archive access points of view.
To make certain that any web content archived using this method can be
properly restored, server-side archiving requires that all original CMSs,
databases and other software are archived alongside the content or are actively maintained in an operational state; or that the content is migrated to newer CMSs, databases, etc. In any case, these activities will be required for the whole period of archive retention. Interestingly, IT backups essentially rely on this method in almost all cases, systematically failing to meet long-term preservation and access capabilities that are essential for legal and compliance requirements.
However, for some types of hidden-web content, this method can prove to be
useful, mainly in situations where it is required to archive parts of websites that a
client-side crawler cannot reach.
The market Although most if not all of the solutions available in the market today use a clientside archiving approach we can split these in two categories.
First category we???re going to call it website copiers due to certain similarities with HTTrack, this technology consists mainly in taking snapshots of websites and archiving them.
Pros:- low cost solution
- small disk usage
Cons:- no dynamic content played
- does not replay the archives
- low level of depth
Examples of companies on the market: Iterasi, Next point these solutions are
suitable for litigation support and compliance (not dynamic media).
The second category is content archiving. This web archiving method allows the capture of rich and highly dynamic content.
It uses web bots (i.e crawlers) that capture all web pages (including social
media). The web pages are stored exactly as they are captured (including links,
rich media, video, and Flash)).
Pros:- a technology that capture multiple web formats in dynamic websites,
- high-quality archive accessibility and rendering,
- fulltext search for large web archive collections
- deduplicated full-text search results in real-time,
- daily archiving capabilities,
- support of WARC ISO file format,
- In-House solutions.
Cons:- these solutions costs more than the ones from the first category,
- consume ressources (disk space, cpu etc.)
Some companies are pushing web archiving further than a compliance solution, such as Aleph Archives which implemented business intelligence tools to take advantage of the tremendous amount of data gathered on the internet.
Some companies claim to have an In-house solution but most of them only store the data In-House.
Examples of companies offering this solution: Aleph Archives, Hanzo Archives,
Cloud based: Most of the solutions mentioned above are cloud based; the pricing differs from a company to another since there is only a few competitors offering
Usually and logically solutions that take only snapshots are more affordable due to a less complicated technology and a small disk space usage.
Solutions offering a full capture of the websites in depth are more expensive, and usually charge per URL, and base their price on the archiving frequency, the scope (list of URLs), and the operation fees (maintenance, data security,
Some companies base their prices accordingly to the data storage.
In-House solutions: there is a very few companies to provide a fully automated In-House solution. The In-House solution???s price is hardly determined, and it???s usually more expensive than the cloud based however it can be considered as a one-time fee (the customer purchase the licence) plus maintenance and support if the customer chooses a support plan.
Recommendations prior to buy a web archiving solution:- specify your needs: if you are under a regulation, you can be compliant
using any of the solutions mentioned above, however in making your
decision you will need to consider the fact that a web archiving solution
can go beyond ???the compliance and litigation support need???, such as
providing relevant data to departments, and preserve your corporate
heritage. Numerous corporations and enterprises which are not under any
regulation choose to acquire a web archiving solution for business
intelligence, and social media monitoring content in order to enhance their
customer service and avoid false disclaimers.
- acquiring a web archiving solution, can be an investment or an annual
expense. The In-House solution can be a long term investment, and
allows you to have more freedom, security, and no latency. Some
companies which offer technologically competitive solutions recommend
the In-House deployment.
- ask about the archiving process and judge if it is suitable for your
- Compare the capabilities of all solutions.