rip2db

This language was built to navigate though web sites and extract usefull information into a database or files. Like most small languages the commands are limited but well suited to the task. If you have any questions or suggestions then please contact me through my home page.

Index


Installation

The language will happily run in Linux or Windows XP. You will need PHP to run the interpreter and you'll need either MySQL or Sybase if you want to use the inbuilt database functionality. After downloading rip2db and uncompressing it you'll have a new rip2db directory which will contain the rip2db interpreter, some example programs, a log file, an global file and a save directory where downloaded files will be placed. If you don't have PHP or MySQL then an easy way to install it is to use XAMPP.

UNIX / Linux

Download rip2db interpreter tgz file then ...

> tar xvzf rip2db.tgz
> cd rip2db
> ls
	example_1  example_2 example_3  example_4  example_5
	global rip2db  rip2db.log  save

> php rip2db example_1

Windows XP etc

Download rip2db interpreter zip file then use search to find out where your php.exe

> unzip rip2db.zip
> cd rip2db
> dir
	example_1  example_2  example_3  example_4  example_5
	global rip2db  rip2db.log  save

> c:\php5\php.exe rip2db example_1

Back to index


Loading a web page

LOAD www.bikesandkites.com/beetle.html

LOOP_URL link_url link_txt
        WRITE {link_url}
END_URL
Example 1 extracts all the URL links from a web page. The LOAD command simply loads the web page while the LOOP_URL command loops through all of the URLs (web links) on that page and puts the URL into the 1st variable (link_url) and any associated text into the 2nd variable (link_txt). The WRITE command then writes the value of the variable to the screen. Note that the curley brackets are necessary to get the actual value of the variable and not just write the text "link_url". The END_URL command just marks the end of the loop.

Commands and variable names are case insensitive but I prefer to use upper case for commands and lower case for variables. If a variable name ends in _url then rip2db automatically turns it's value into a proper URL and removes any unneeded text. If the variable ends in _txt then rip2db removes any tags and wasted blank space.

Could you alter the program to print the text associated with the link rather than the link?

Back to index


If conditions and saving files

LOAD http://www.bikesandkites.com/beetle.html

LOOP_URL link_url link_txt
	IF {link_url} IS_LOCAL 
	   AND {link_url} IS_IMAGE 
	   AND {link_url} LIKE 'Th.*jpg'

		WRITE 'Saving {link_txt}'
		SAVE {link_url}
	END_IF
END_URL
Example 2 loops through all the URLs on the page and saves all the thumbnail images on your computer in the save directory. The URLs are only saved if the URL is on the same site, it's an image file and the text of the URL matches "Th.*jpg" (the small thumbnail images on my site all start with "Th"). If these tests succeed then it will save the file onto your local machine. After running this program you should find 4 small images in your save directory.

The IF command can check for all the conditions below and the different conditions can be combined with AND and OR. NOT can be inserted after the initial string to inverse the test. Commands following the IF statement are only executed if the conditions are true. The end of the statements to be conditionaly run is marked with an END_IF command.

  IF     'a string'
  '{VARIABLE}'
  etc
  [NOT]   LIKE 'string'
  IS_IMAGE
  IS_AUDIO
  IS_VIDEO
  IS_LOCAL
  IS_UNDER_DIR
  AND ...
  OR

Could you alter the program to extract all the image files on a different web page?

Back to index


Nesting loops etc

LOAD http://www.kentfolkmp3.supanet.com/

LOOP_URL BAND_URL BAND_TXT
        IF {BAND_TXT} LIKE music

                LOAD {BAND_URL}
                WRITE '{BAND_TXT}'

                LOOP_URL LISTEN_URL LISTEN_TXT
                        IF {LISTEN_URL} IS_AUDIO
                                WRITE '   Saving {SEQ} {LISTEN_URL}'
                                SAVE {LISTEN_URL}
                        END_IF
                END_URL
        END_IF
END_URL
Example 3 goes to a folk music site and then skips from page to page pulling the mp3 music for each band onto your local machine. It demonstrates how loops and if statements can be nested. The SEQ variable is a special variable that starts at one and increments each time you look at it.

Could you alter the program so it just lists all the links that are not audio links?

Back to index


Navigating a whole website

LOAD http://www.bikesandkites.com

LOOP_URL RECURSE link_url link_txt
     IF {link_url} LIKE 'html$'
          WRITE '{PCT_DONE}  {link_url}'
     END_IF
END_URL
The first example just looked at the links on a single web page while the second example extended this to look at the web pages connected to the initial page. What if we want to look at all the pages on a web site? Example 4 goes through all the URLs on the site including the URLs on other pages. It will only visit pages that are on the same site but it will return all URLs listed on any page of this site. It will only return a URL once even if it appears on many different pages. This is a type of recursion.

The PCT_DONE variable is a system variable that shows what percentage of pages have been read so far. Remember that some sights may have hundreds, if not thousands of linked web pages, so do try and be carefull. There is a maximum number of pages that can be retrieved (MAX_NUM_LOADS) and this defaults to top the program swamping another web site by repeatidly reading page after page the program will add a delay between reading pages, this defaults to 10 seconds but will be overiden by the robots.txt on the host site.

Could you alter the program to also list all the images on each page?

Back to index


Writing to a database

DB MYSQL localhost db_name user password

SQL 'drop table if exists MyImgTab'
SQL 'create table MyImgTab( img_url  varchar(255) )'

LOAD http://www.bikesandkites.com

LOOP_URL RECURSE LINK_URL LINK_TXT
    IF {LINK_URL} IS_IMAGE
        WRITE '  saving image {LINK_URL}'
        SQL 'insert MyImgTab values ( "{LINK_URL}" )'
    END_IF
END_URL
Example 5 is traversing the whole of my site again but this time will we store all the image links into a table we've created on the database. You will need to alter the DB command to use your own database name, username and password. The program will create a table called MyImgTab and then later insert each new image URL. Any type of SQL can appear here but this would be a typical use of the program.

Obviously if you don't have a database to connect to then this will cause an error when it first tries to connect to the database that isn't there. If we want to conect to a Sybase database then alter the DB command from MYSQL to SYBASE. The sql in the drop table command will probably have to alter to just 'drop table MyImgTab'. Rather than having to put a DB command into every program you could just put it once into the global file - this will then get called before running any program.

Could you alter the program to use your own web site?

Back to index


Reading from a database

DB MYSQL localhost db_name user pwd

LOOP_SQL 'select img_url from MyImgTab'
     WRITE 'image stored was {img_url}'
END_SQL
Example 6 reads data directly from the database. Each field in the select must have a name and this name then becomes the name of a variable in your program. It's reading from the table we created in the previous example. Could you select a couple of fields out from a table of your own?

Back to index


Extracting data from HTML tables

LOAD http://www.exchangerate.com/

LOOP_TABLE 0
        WRITE '{COL_2} | {COL_4} | {COL_5}'
END_TABLE
Example 7 has a special loop called LOOP_TABLE that pulls all the entries from a web table, one row at a time. The column values are put into variables COL_1, COL_2 ... so the values can easily be used within your program.

Note the number after the LOOP_TABLE command - it is used to indicate which table on the web page to extract the data from. If you use 0 then the program will take data from the largest table on the page. If you run this using DEBUG then you will be shown how many records each table on the web page has. Just put the command DEBUG at the top of the program to go into debug mode.

Can you alter the program to extract this data into a database table?

Back to index


Other commands

/*
**	comments are done like this
*/
LOAD www.bikesandkites.com/beetle.html

DEBUG

SET var 'I like Apples'
REPLACE var 'apples' 'bananas'
WRITE {var}

RELOAD

LOOP beetle
     GET ' {word_txt} '
     WRITE {word_txt}
END_LOOP
This type of loop takes the input and splits it into sections using the string after the LOOP command. GET will then search through each section looking for a pattern (here a space). Then the variable will be populated with all the text between this first pattern and the next pattern (another space). So it will just pull the word that follows each occurrence of the word beetle. You can have multiple patterns and multiple variables encased in curly brackets in the string following the GET command. The RELOAD command will reload the contents of the last url accessed - in this particular instance it does nothing usefull but it will do this from cache so at least it will be quick.

Back to index


Commands

str Can be any string including variables enclosed in curly brackets which will be replaced by their values. Strings without spaces do not need quotes.
url Just like str but this time the end value will be treated as a URL. Relative URLs will be expanded into full URLs etc.
pat This is a regular expression that provides a pattern to search for. It is case insensitive.
var Is just the name of a variable and it is not enclosed in curly brackets. Variable names are always in upper case. Some variable names are predefined ie SEQ, QUEUE, COL_1, COL_2 ...

Back to index


Special variables

SEQ Starts off at 1 and then increments each time it is accessed.
QUEUE When using queues it holds the current item in the queue.
PCT_DONE When used with RECURSE it shows the percentage of the queue that has been processed.
DELAY_BETWEEN_LOADS Holds the delay in seconds before loading another page. This defaults to 10 seconds but will respect any value held in a web sites robots.txt file.
MAX_NUM_LOADS Holds the maximum number of pages that a program can load. It defaults to 300. This is usefull to stop programs from running forever and trying to process the whole of the internet.
CURRENT_FILE The name of the last file loaded.
CURRENT_BASE The base URL of the last file loaded.
VERSION The version of the rip2db interpreter.
NUM_LOADS How many loads have been performed so far.
PCT_DONE Shows what percentage of the queue has currently been processed.
DATE Shows today's date as YYMMDD.
TIME Shows the time as HH:MM
TITLE HTML title for current web page

Back to index


Files and directories

rip2db This is the interpreter which and it's written in PHP to call it just type php rip2db your_program
rip2db.log This holds the log for each run, if you use rip2db a lot then you might want to delete this file every now and then to stop it growing too big.
save Any files you pull of the web with the SAVE command are stored in this directory with the basename of the URL.
global This is a special file which, if present, is loaded before running any program. It's a good place to put your DB command so that you don't need to specify it in every program.

Back to index


Logs

The log file (rip2db.log) simply stores what has been loaded and any errors that have occurred. It will continually grow but you can delete it at any point.

080324 16:26 ===================================
080324 16:26 = Running program eg_4
080324 16:26 ===================================
080324 16:26 Loaded http://www.bikesandkites.com (length 2kb)
080324 16:26 Loaded http://www.bikesandkites.com/ppg_flight_log.html (length 82kb)
080324 16:26 Loaded http://www.bikesandkites.com/beetle.html (length 4kb)
080324 16:26 Loaded http://www.bikesandkites.com/bike.html (length 19kb)
080324 16:26 Loaded http://www.bikesandkites.com/cham.html (length 3kb)
080324 16:26 Loaded http://www.bikesandkites.com/kites.html (length 6kb)
080324 16:27 Loaded http://www.bikesandkites.com/int.html (length 2kb)
080324 16:27 Loaded http://www.bikesandkites.com/mame2.html (length 33kb)
080324 16:27 Loaded http://www.bikesandkites.com/pvr.html (length 15kb)
080324 16:27 Loaded http://www.bikesandkites.com/bc_binary_clock.html (length 10kb)
080324 16:27 Loaded http://www.bikesandkites.com/pixalator.html (length 5kb)
080324 16:28 Loaded http://www.bikesandkites.com/lcd_clock.html (length 8kb)
080324 16:28 Loaded http://www.bikesandkites.com/me.html (length 3kb)
080324 16:28 Loaded http://www.bikesandkites.com/personal.html (length 7kb)
080324 16:28 Loaded http://www.bikesandkites.com/index_frames.html (length 1kb)
080324 16:28 Loaded http://www.bikesandkites.com/ppg_faq.html (length 16kb)
080324 16:28 Loaded http://www.bikesandkites.com/ppg_extra_info.html (length 4kb)
080324 16:29 Loaded http://www.bikesandkites.com/index.html (length 2kb)
080324 16:29 Loaded http://www.bikesandkites.com/diamond.html (length 657b)
080324 16:29 Loaded http://www.bikesandkites.com/arcade.html (length 18kb)
080324 16:29 Loaded http://www.bikesandkites.com/lcd_ny_story.html (length 3kb)
080324 16:29 Loaded http://www.bikesandkites.com/scud.html (length 8kb)
080324 16:29 Loaded http://www.bikesandkites.com/danny_boy.html (length 2kb)
080324 16:30 Loaded http://www.bikesandkites.com/ppg_fwd_launch.html (length 5kb)
080324 16:30 Finished eg_4, loaded 24 files with total size 268kb

Hope you find the language useful but it's provided as is with no guarentees. Please feel free to send me an email with any issues or suggestions.

Back to index