cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

URL-Filter for archive.org

Hi,

i have a request to whitelist the website http://archive.org. Since this Website is presenting other Websites content of the past, it is correctly categorized as "Anonymizing Utilities".

Now the question ist, if there is a posibility to use our URL-Filter Ruleset for the presented Websites.

As far as i have seen, there are two indicators which maybe could be used to identify the displayed Site:

1. There is the URL-Path, which shows up Website like (https://web.archive.org/web/20160802092006/http://9gag.com/ ). (Which indeed should be blocken in our company)

2. There is a HTTP-Header called "Referer", which also Points out to the Site (https://web.archive.org/web/*/http://9gag.com)

Does anyone have an idea how to solve this?

5 Replies

Re: URL-Filter for archive.org

Maybe the easiest way is to overwrite category for site archive.org.

This can be done by list of websites and their category. This settings is available in URL Filter settings.

category.png

Re: URL-Filter for archive.org

Hi Lubomir,

thank you for your fast reply. The category overwrite is not exactly what i was looking for, because i cannot manually Categorize just every Website in the Database of this site 🙂 The Website is showing up the past of millions of stored websites. So i thought about a automatical URL-Category overwrite depending on the content of the site.

(First i thought about a similar Way, like the Youtube Filter, but Youtube has its own API which returns what you need. So its not the same Way in this case.)

btlyric
Reliable Contributor
Reliable Contributor
Report Inappropriate Content
Message 4 of 6

Re: URL-Filter for archive.org

It should be possible to allow access to archive.org and still block the archive.org instances of sites that you would ordinarily block. You can also do this for other sites which have cached content.

Create a new URL Filter configuration with "Search for and rate embedded URLs" selected. For this example, I'll refer to it as Default with Embedded.

Create a new Category list that includes all of the categories that you would usually block, but do not include Anonymizing Utilities. For this example, I'll refer to it as Bad Category No Anonymizing Utilities

Create a new list for the sites that will be handled this way. I'll refer to this one as Cached Content Sites. I used a wildcard list for future flexibility, but did not use wildcard/regex matching for the entries. Add archive.org and web.archive.org to the list.

In the same rule set and above your existing rule which blocks specific categories, add a new rule:

Criteria:

URL.Host matches in list Cached Content Sites AND

URL.Categories<Default with Embedded> none in list Bad Category No Anonymizing Utilities

Action:

Stop Rule Set.

With this rule in place, what should happen is that archive.org itself and site content hosted on archive.org is permitted, but any sites which would be blocked through the normal category blocking will still be blocked.

For example, the URL http://repo.hackerzvoice.net/depot_madchat/reseau/anti-peer2peer-networks.txt is classified as Malicious Software, Malicious Downloads.

When the site is accessed via archive.org, the URL looks like this:

http://web.archive.org/web/20150731000606/http://repo.hackerzvoice.net/depot_madchat/reseau/anti-pee...

Because the URL Filter settings for this rule are looking at the embedded URL, the categorizations for that URL will also be considered.The rule will fire with action Stop Rule Set for sites that don't match in the Bad Category No Anonymizing Utilities and the original blocking rule will be skipped. If a site's categorization is in the new list, the rule won't fire, Stop Rule Set won't be applied and the rest of that specific rule set will be considered.

One caveat associated with archive.org is that the categorization for it is currently Education/Reference for web.archive.org and Internet Services for archive.org so if you're not looking at embedded URLs in your primary URL Filter configuration, ALL of the content hosted on archive.org will be accessible unless a category overwrite is applied for "archive.org" to force the categorization back to Anonymizing Utilities and the results from the configuration described here will not be what would be expected.

Re: URL-Filter for archive.org

Hi,

either my ruleset doesn't work properly and I didn't understand your howto or the mwg doesn't work properly anymore.

Could you review my ruleset, please?

 

List of categories which are not allowedList of categories which are not allowedgrafik.pnggrafik.pnggrafik.pnggrafik.pnggrafik.png

 

Thank you very much.

btlyric
Reliable Contributor
Reliable Contributor
Report Inappropriate Content
Message 6 of 6

Re: URL-Filter for archive.org

I looked at your rule set and other than an issue with the order of criteria where URL.Host values should be considered before the URL Filter settings, I didn't see anything obviously incorrect.

Testing against current rule set which should work returned results that suggest that there's possibly something broken in the embedded URL processing mechanism.

http://web.archive.org/web/20150731000606/http://repo.hackerzvoice.net/depot_madchat/reseau/anti-pee... should have been blocked, but was not.

I can definitively state that the "archive.org/Content Cache" ruleset as described in previous post worked with version 7.5.2 as recently as 9/2016, but I do not immediately have data points between then and now.

 

 

 

 

 

 

 

 

You Deserve an Award
Don't forget, when your helpful posts earn a kudos or get accepted as a solution you can unlock perks and badges. Those aren't the only badges, either. How many can you collect? Click here to learn more.

Community Help Hub

    New to the forums or need help finding your way around the forums? There's a whole hub of community resources to help you.

  • Find Forum FAQs
  • Learn How to Earn Badges
  • Ask for Help
Go to Community Help

Join the Community

    Thousands of customers use our Community for peer-to-peer and expert product support. Enjoy these benefits with a free membership:

  • Get helpful solutions from product experts.
  • Stay connected to product conversations that matter to you.
  • Participate in product groups led by employees.
Join the Community
Join the Community