During the early days of the internet, fluorescent text and pixelated gifs could be seen everywhere. But over time, the zany, eye-catching colour schemes and heavy usage of WordArt were replaced with more design-conscious iterations of the web.

As the world’s history and culture increasingly shifted online, and old web pages were continually replaced by newer ones, the National Library of Australia (NLA) faced a challenge regarding how it would fulfill its role of documenting Australia’s history and culture.

Rather than lose online information that has national significance to Australia’s history and culture, the NLA built an archive of online content to show how Australian websites have evolved over time.

The archive, called the Australian Web Archive (AWA), sheds light into the world of the late ’90s and ’00s by providing a snapshot of the internet during its infancy. The AWA has online content that includes Australian websites ending in “au” from 1996 onwards; content that NLA’s curators have deemed to be culturally significant; and online Australian government content–all to show how Australian websites have evolved over time.

By keeping this content in an online archive, it has given the NLA the capacity to record information that resides online, which has increasingly become where Australia’s history and culture is located and created. NLA’s chief information officer, David Wong, told TechRepublic that it was important for the organisation to have an archive that could capture online information, as it continued to migrate physical forms of information such as manuscripts and journals online.

“A lot of physical information has also moved to digital. A lot of historians today have migrated to using the digital collections instead of manuscripts and journals,” Wong said.

The AWA contains 600 terabytes of data across 9 billion records; it is a combination of records from the PANDORA Archived websites, the Australian Government Web Archive, and websites relating to Australia collected annually through large-scale crawl harvests.

SEE: 60 ways to get the most value from your big data initiatives (free PDF) (TechRepublic)

How the AWA sorts through online junk and fake news

The difficulty in creating such an archive however, Wong said, was ensuring that only the most relevant information was collected.

“There’s just so much great content and a challenge for us is to decide on what data to collect, how to collect data, and how to make sense of that information, so that people who come to our archives can actually find the content they are looking for,” Wong explained.

To achieve this, Wong and the team behind AWA put a lot of thought into creating a system that could distinguish between what was culturally significant to Australia and what was “junk”.

Various classification technologies were used to make the archive, including a modified version of Google’s 1998 page rank algorithm, a Bayesian filter, and a Yahoo NSFW classifier to sort through the internet’s content. According to Wong, the NLA chose to use Google’s 1998 page rank algorithm–which ranks content based on the frequency a page is clicked into–as it is “often a really good indicator of quality of the content”. The use of Yahoo’s NSFW classifier was also important, Wong added, as a lot of website traffic is driven by pornographic content and the classifier can identify and classify images that are inappropriate for the archive. Bayesian filters, commonly used for email and spam filtering, are also used by the archive.

At a time when “fake news” has run rampant across the internet, thereby increasing societal concern regarding trust and the doctoring of information, Wong also acknowledged the importance of prioritising authenticity when it came to creating the archive.

To safeguard against the AWA’s data from being altered, the content collected is in read-only format only so that it is very difficult to modify the information down the track. The NLA also keeps multiple backup versions of the content, which includes three cookies for each piece of content, Wong said.

SEE: Data backup request form (Tech Pro Research)

Being an archive dedicated towards preserving “Australia’s memory”, the AWA also takes regular snapshots of content that has been updated over time. By taking snapshots of content throughout different periods, this allows users of the archive to not only traverse through content to see how it has changed over time, but determine whether any information has been modified or doctored.

Remembering the past

The AWA was created with the intention of allowing users to get a better picture on specific subject matter, such as online Australian politics coverage. While the archive does not contain as much information as social media platforms such as Twitter and Facebook, Wong said the AWA differentiates itself by deliberately limiting the amount of information it stores on a specific subject matter. According to Wong, this creates a balance between allowing users to not get distracted, and ensuring that users are still able to see the evolution of the internet and explore a subject matter extensively.

The archive also provides various search functions to make it easier for users to browse. The search functions include using Boolean search operators, in addition to being able to specify searches by domain, filetype, range of dates, and whether it is a government website or not.

SEE: Photos: 23 milestones in the history of the web (TechRepublic)

With the release of Australia’s 2019 Federal Budget on Tuesday evening, which proposes to give NLA AU$10 million over the next four years to set up a Digitisation Fund, it looks like the organisation will continue to have the opportunity to find even better ways to document the important moments within Australia’s history and culture.

“The Digitisation Fund, which will also seek philanthropic contributions, will enable the continued digitisation of the NLA’s significant collection and expand its availability to all Australians through its online database, Trove,” the Budget documents said.

The National Archives of Australia will also similarly guide Commonwealth agencies and departments to “promote and provide widespread access to the national archival collection through a national network of reading rooms, reference services, and education and public programs, taking advantage of the opportunities provided by known and emerging technology”.

SEE: Get more Tech History must-see coverage (TechRepublic on Flipboard)

With digital archives set to stay as Australia’s medium for memory, it will be important for it to adapt to new changes as technology continues to move full-steam ahead. Just as important, NLA director general Mary-Louise Ayre said, is that organisations have the foresight and innovative thinking to have the capability to capture both the now and the future.

“For those of us who lived and worked before the dawn of the website, it’s a fascinating reminder of how much things have changed. For those who’ve never known a world without the web, it’s a remarkable history lesson,” Ayres said.

Related coverage

Open source Spectrum library enables edge processing of images for faster performance (TechRepublic)
Spectrum can be used to perform image processing on smartphones before uploading data to servers, providing higher quality images than native APIs.

How a former Apple lead plans to make developers key to security solutions (TechRepublic)
Security has tended to be a bolt-on to enterprise software, but Sqreen hopes to make it part of the normal way developers work.

How to choose the right email marketing service (TechRepublic)
Email marketing is one of the best ways for SMBs to establish loyal relationships with customers. These guidelines will help you evaluate your options and find the best email marketing solution.

How to care for the fragile nature of strategy (TechRepublic)
Determining your strategy may seem like the hard part. In reality, it’s protecting and nurturing your strategy as it’s developed and executed.

Taking a page from the New York Public Library’s approach to Instagram (TechRepublic)
NYPL’s Chad Felix talked to TechRepublic about what happens when a library more than a hundred-years-old gets an Instagram account.