Search Engine Robots or Web Crawlers
Most of the common users or visitors use totally different out there search engines to go looking out the piece of information they required. However how this info is provided by search engines? Where from they need collected these information? Essentially most of these search engines maintain their own database of information. These database includes the sites accessible within the webworld which ultimately maintain the detail net pages data for every on the market sites. Primarily search engine do some background work by using robots to gather info and maintain the database. They create catalog of gathered info and then gift it publicly or at-times for non-public use.
In this text we can discuss regarding those entities that loiter in the world internet surroundings or we tend to can about net crawlers that move around in netspace. We tend to can learn
· What it’s all regarding and what purpose they serve ?
· Professionals and cons of using these entities.
· How we can keep our pages away from crawlers ?
· Differences between the common crawlers and robots.
In the following portion we tend to can divide the full research work under the subsequent 2 sections :
I. Search Engine Spider : Robots.txt.
II. Search Engine Robots : Meta-tags Explained.
I. Search Engine Spider : Robots.txt
What’s robots.txt file ?
A web robot may be a program or search engine software that visits sites frequently and automatically and crawl through the net’s hypertext structure by fetching a document, and recursively retrieving all the documents that are referenced. Sometimes website homeowners don’t want all their site pages to be crawled by the net robots. For that reason they will exclude few of their pages being crawled by the robots by using some commonplace agents. Therefore most of the robots abide by the ‘Robots Exclusion Normal’, a collection of constraints to restricts robots behavior.
‘Robot Exclusion Commonplace’ is a protocol utilized by the positioning administrator to regulate the movement of the robots. When search engine robots come to a web site it can look for a file named robots.txt in the basis domain of the location (http://www.anydomain.com/robots.txt). This is often an apparent text file which implements ‘Robots Exclusion Protocols’ by allowing or disallowing specific files inside the directories of files. Website administrator can disallow access to cgi, temporary or personal directories by specifying robot user agent names.
The format of the robot.txt file is very simple. It consists of 2 field : user-agent and a number of disallow field.
What’s User-agent ?
This is the technical name for an programming concepts in the world wide networking surroundings and used to say the specific search engine robot within the robots.txt file.
As an example :
User-agent: googlebot
We will conjointly use the wildcard character “*” to specify all robots :
User-agent: *
Suggests that all the robots are allowed to return to visit.
What is Disallow ?
In the robot.txt file second field is known as the disallow: These lines guide the robots, to that file ought to be crawled or which should not be. For instance to prevent downloading email.htm the syntax will be:
Disallow: email.htm
Forestall crawling through directories the syntax can be:
Disallow: /cgi-bin/
White Area and Comments :
Using # at the beginning of any line in the robots.txt file will be thought of as comments solely and using # at the beginning of the robots.txt like the following example entail us which url to be crawled.
# robots.txt for www.anydomain.com
Entry Details for robots.txt :
one) User-agent: *
Disallow:
The asterisk (*) in the User-agent field is denoting “all robots” are invited. As nothing is disallowed therefore all robots are liberated to crawl through.
2) User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /private/
All robots are allowed to crawl through the all files except the cgi-bin, temp and non-public file.
three) User-agent: dangerbot
Disallow: /
Dangerbot isn’t allowed to crawl through any of the directories. “/” stands for all directories.
4) User-agent: dangerbot
Disallow: /
User-agent: *
Disallow: /temp/
The blank line indicates starting of latest User-agent records. Except dangerbot all the other bots are allowed to crawl through all the directories except “temp” directories.
5) User-agent: dangerbot
Disallow: /links/listing.html
User-agent: *
Disallow: /email.html/
Dangerbot isn’t allowed for the listing page of links directory otherwise all the robots are allowed for all directories except downloading email.html page.
half-dozen) User-agent: abcbot
Disallow: /*.gif$
To get rid of all files from a specific file type (e.g. .gif ) we tend to can use the higher than robots.txt entry.
seven) User-agent: abcbot
Disallow: /*?
To limit web crawler from crawling dynamic pages we tend to can use the higher than robots.txt entry.
Note : Disallow field might contain “*” to follow any series of characters and could end with “$” to indicate the top of the name.
Eg : Among the image files to exclude all gif files but permitting others from google crawling
User-agent: Googlebot-Image
Disallow: /*.gif$
Disadvantages of robots.txt :
Problem with Disallow field:
Disallow: /css/ /cgi-bin/ /images/
Totally different spider will browse the above field in several way. Some can ignore the areas and can scan /css//cgi-bin//pictures/ and might only think about either /pictures/ or /css/ ignoring the others.
The proper syntax should be :
Disallow: /css/
Disallow: /cgi-bin/
Disallow: /pictures/
All Files listing:
Specifying every and each file name among a directory is most commonly used mistake
Disallow: /ab/cdef.html
Disallow: /ab/ghij.html
Disallow: /ab/klmn.html
Disallow: /op/qrst.html
Disallow: /op/uvwx.html
Higher than portion will be written as:
Disallow: /ab/
Disallow: /op/
A trailing slash suggests that a ton that is a directory is offlimits.
Capitalization:
USER-AGENT: REDBOT
DISALLOW:
Though fields are not case sensitive however the datas like directories, filenames are case sensitive.
Conflicting syntax:
User-agent: *
Disallow: /
#
User-agent: Redbot
Disallow:
What can happen ? Redbot is allowed to crawl everything however can this permission override the disallow field or disallow will override the allow permission.
II. Search Engine Robots: Meta-tag Explained:
What’s robot meta tag ?
Besides robots.txt search engine is also having another tools to crawl through web pages. This is the META tag which tells web spider to index a page and follow links on it, that might be a lot of useful in some cases, because it can be used on page-by-page basis. It is additionally helpful incase you don’t have the requisite permission to access the servers root directory to manage robots.txt file.
We have a tendency to used to place this tag among the header portion of html.
Format of the Robots Meta tag :
In the HTML document it is placed within the HEAD section.
html
head
META NAME=”robots” CONTENT=”index,follow”
META NAME=”description” CONTENT=”Welcome to…….”
title……………title
head
body
Robots Meta Tag options :
There are four choices which will be employed in the CONTENT portion of the Meta Robots. These are index, noindex, follow, nofollow.
This tag allowing search engine robots to index a specific page and will follow all the link residing on it. If web site admin doesn’t wish any pages to be indexed or any link to be followed then they’ll replace “ index,follow” with “ noindex,nofollow”.
According to the wants, site admin will use the robots in the following completely different choices :
META NAME=”robots” CONTENT=”index,follow”> Index this page, follow links from this page.
META NAME=”robots” CONTENT =”noindex,follow”> Don’t index this page however follow link from this page.
META NAME=”robots” CONTENT =”index,nofollow”> Index this page however don’t follow links from this page
META NAME=”robots” CONTENT =”noindex,nofollow”> Don’t index this page, don’t follow links from this page.
To find online roulette systems, roulette strategies, winning ways of making a bet, and useful software, visit: how to win at roulette. how to win at roulette is really a great and comfortable way to earn A LOT OF money. Get everything you need to know about how to win at roulette!
|
|
|
Tagged with: Business • internet business • Internet Marketing • online business • search engine • SEO
Filed under: Uncategorized
Like this post? Subscribe to my RSS feed and get loads more!
Leave a Reply
You must be logged in to post a comment.