Some Knowledge

Data Mining and Screen Scrapers

Posted in blogging, computers, internet by someknowledge on April 2nd, 2008

The Internet is a vast sea of raw data to some people.  Data mining is the process of using a computer to sift through a lot of material and find what a person is looking for.  A screen scraper is a web application that collects data off the Internet for use by another person.  Both the process of data mining and the tools like bots and screen scrapers are being used more and more as the quantity of information online increases.

How does this affect anyone with a blog?  The answer is that your public posts can be seen as a source of raw data for persons looking to promote their own websites and businesses.  Something you write might be seen as good publicity for some product.  Your popular blog posts might be reposted on another site to attract traffic to another person’s operation.  Your email address might be lifted off a website and sold to spammers.  If you were to post personal information this could be used to spoof your identity to apply for fraudulent credit cards or other services.

The FBI and the CIA have been using data mining techniques for years to track criminals.  The identities of the 9-11 hijackers was supposedly found with data mining.  Any database that is accessible can be a source for information on people that may be of use to some organization.  Computers are very quick and precise at sorting out information.

With the wide open nature of the Internet, anyone can use commonly available software to find out just about anything they need to know.  If you have ever used a search engine you have mined data.  It is just not possible for a human being to sort through such a huge amount of information.  If there is anything to be concerned about in your own interaction with the net, it is probably to keep information you do not want publicized private.  If you post something somewhere there is probably an application that can find it faster than you can remember where you put it.  Collecting data on purchasing habits and web usage is big business.

Since I started this particular blog I have seen more of my writing exerpted and posted to other websites than I have ever seen before.   WordPress is being mined for content.  I’m sure there are companies out there that do nothing but steal text from the net to post on their own sites.  Finding these links is as simple as scrolling through your spam comments.  I guess if what you say is important enough to steal you might be happy.  You might as easily be annoyed at people posting duplicate content and bogus authorship to your work.  As long as there is open access to your writing this will be a problem.

Interesting Game Development Site

Posted in computers, games by someknowledge on April 2nd, 2008

GameDev.net is an interesting site that has a lot of information about developing computer games.  There is reference material, articles about design, links to different games, news, and postings for jobs in the computer games area.

Under the resources tab for this site you can find a nice library of articles about almost every aspect of game design from AI to running your own game company.  There is some very interesting technical information here that is available for free.  If you are at all interested in going into the computer games industry or are just interested in how games work, this is a good site to check out.