When I tried to use HTTrack to download a single website using the program's default settings (as of Nov. 2016), I downloaded the website but also got some other random files from other domains, presumably from links on the main domain. In some cases, the number of links that the program tried to download grew without limit, and I had to cancel. In order to download files only from the desired domain, I had to do the following. Step 1: Specify the domain(s) to download (as I had already been doing). Step 2: Add a Scan Rules pattern like this: +. This way, only links on that domain will be downloaded. Including a * before the main domain name is useful in case the site has subdomains. Troubleshooting: Error: "Forbidden" (403) For example, the site has a subdomain, which would be missed if you only used the pattern +*. Some pages gave me a "Forbidden" error, which prevented any content from being downloaded. I was able to fix this by clicking on "Set options.", choosing the "Browser ID" tab, and then changing "Browser 'Identity'" from the default of "Mozilla/4.5 (compatible: HTTrack 3.0x Windows 98)" to "Java1.1.4". I chose the Java identity because it didn't contain the substring "HTTrack", which may have been the reason I was being blocked. On Mac, I download websites using SiteSucker. This page gives configuration details that I use when downloading certain sites. I think website downloads using the above methods don't include the redirects that a site may be using. A redirect ensures that an old link doesn't break when you move a page to a new url.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |