阿巴斯车贵吗值得买吗(A Basic Website Crawler, in Python, in 12 Lines of Code. « Null Byte)
Step 1 Layout the logic.
OK, as far as crawlers (web spiders) go, this one cannot be more basic. Well, it can, if you remove lines 11-12, but then its about as useful as a broken pencil - theres just no point. (Get it? Hehe...he...Im a sad person... )
So what does a webcrawler do? Well, it scours a page for URLs (in our case) and puts them in a neat list. But it does not stop there. Nooooo sir. It then iterates through each found url, goes into it, and retrieves the URLs in that page. And so on (if you code it further).
What we are coding is a very scaled down version of what makes google its millions. Well it used to be. Now its 50% searches, 20% advertising, 10% users profile sales and 20% data theft. But hey, whos counting.
This has a LOT of potential, and should you wish to expand on it, Id love to see what you come up with.
So lets plan the program.
The logic here is fairly straightforward:
user enters the beginning url crawler goes in, and goes through the source code, gethering all URLs inside crawler then visits each url in another for loop, gathering child urls from the initial parent urls. profit???Step 2 The Code:
#! C:\python27
import re, urllib
textfile = file(depth_1.txt,wt)print "Enter the URL you wish to crawl.."print Usage - "http://phocks.org/stumble/creepy/" <-- With the double quotesmyurl = input("@> ")for i in re.findall(href=["](.[^"]+)["], urllib.urlopen(myurl).read(), re.I): print i for ee in re.findall(href=["](.[^"]+)["], urllib.urlopen(i).read(), re.I): print ee textfile.write(ee+\n)textfile.close()
Thats it... No really.. That. Is. It.
So we create a file called depth_1. We prompt the user for entry of a url
Which should be entered in the following format -"http://www.google.com/"
With the quotation.
Then we loop through the page we passed, parse the source and return urls, get the child urls, write them to the file. Print the urls on the screen and close the file.
Done!
Finishing Statement
So, I hope this aids you in some way, and again, if you improve on it - please share it with us!
Regards
Mr.F
创心域SEO版权声明:以上内容作者已申请原创保护,未经允许不得转载,侵权必究!授权事宜、对本内容有异议或投诉,敬请联系网站管理员,我们将尽快回复您,谢谢合作!