What should I use to crawl/download/archive an entire website? It's all simple static pages (no JavaScript), but has lots of links to download small binary files, which I also want to preserve. Any OS -- just want the best tools.

Follow

@cancel In the few times I did it in the past I used wget --mirror with a few tweaked parameters for directory traversal and domain-spanning.

· · Web · 1 · 0 · 1

@blindcoder It only seems to download .html, images, css, etc.

@cancel It can only follow HTML code, naturally, it'll follow all hyperlinks regardless of data type.

@blindcoder No, it's not downloading .zip files that are linked from .html files.

@cancel @blindcoder are the binary files hosted on the same domain as the html+images+css? I think wget needs explicit options to allow getting from multiple domains in recursive mode, and it might also need options to limit recursion depth in that case to avoid downloading the whole internet...

@mathr @blindcoder They're on the same domain. It looks like either wget requires the file extensions be added to a list of accepted file extensions, or also that robots.txt be ignored. I can do the latter, but I'm not sure how to do the former, because there are many varied file extensions I want to back up. Is there a way to wildcard it?

@cancel @mathr Well, wget does respect robots.txt by default. Try with this: wget -e robots=off

@cancel @mathr I think wget also respects rel=nofollow but I don't know how to turn that off...

Sign in to participate in the conversation
toot.BERLIN

The Mastodon instance for Berlin. Open to all. Die Mastoden-Instanz für Berlin, offen für Alle, selbst Brandenburger 😉