This language was built to navigate though web sites and extract usefull information into a database or files. Like most small languages the commands are limited but well suited to the task. If you have any questions or suggestions then please contact me through my home page. |
Index |
Download rip2db interpreter tgz file then ... > tar xvzf rip2db.tgz > cd rip2db > ls example_1 example_2 example_3 example_4 example_5 global rip2db rip2db.log save > php rip2db example_1 |
Download rip2db interpreter zip file then use search to find out where your php.exe > unzip rip2db.zip > cd rip2db > dir example_1 example_2 example_3 example_4 example_5 global rip2db rip2db.log save > c:\php5\php.exe rip2db example_1 |
LOAD www.bikesandkites.com/beetle.html LOOP_URL link_url link_txt WRITE {link_url} END_URL |
Commands and variable names are case insensitive but I prefer to use upper case for commands and lower case for variables. If a variable name ends in _url then rip2db automatically turns it's value into a proper URL and removes any unneeded text. If the variable ends in _txt then rip2db removes any tags and wasted blank space.
Could you alter the program to print the text associated with the link rather than the link?
LOAD http://www.bikesandkites.com/beetle.html LOOP_URL link_url link_txt IF {link_url} IS_LOCAL AND {link_url} IS_IMAGE AND {link_url} LIKE 'Th.*jpg' WRITE 'Saving {link_txt}' SAVE {link_url} END_IF END_URL |
The IF command can check for all the conditions below and the different conditions can be combined with AND and OR. NOT can be inserted after the initial string to inverse the test. Commands following the IF statement are only executed if the conditions are true. The end of the statements to be conditionaly run is marked with an END_IF command.
IF |
'a string' '{VARIABLE}' etc | [NOT] |
LIKE 'string' IS_IMAGE IS_AUDIO IS_VIDEO IS_LOCAL IS_UNDER_DIR |
AND ... OR |
Could you alter the program to extract all the image files on a different web page?
LOAD http://www.kentfolkmp3.supanet.com/ LOOP_URL BAND_URL BAND_TXT IF {BAND_TXT} LIKE music LOAD {BAND_URL} WRITE '{BAND_TXT}' LOOP_URL LISTEN_URL LISTEN_TXT IF {LISTEN_URL} IS_AUDIO WRITE ' Saving {SEQ} {LISTEN_URL}' SAVE {LISTEN_URL} END_IF END_URL END_IF END_URL |
Could you alter the program so it just lists all the links that are not audio links?
LOAD http://www.bikesandkites.com LOOP_URL RECURSE link_url link_txt IF {link_url} LIKE 'html$' WRITE '{PCT_DONE} {link_url}' END_IF END_URL |
The PCT_DONE variable is a system variable that shows what percentage of pages have been read so far. Remember that some sights may have hundreds, if not thousands of linked web pages, so do try and be carefull. There is a maximum number of pages that can be retrieved (MAX_NUM_LOADS) and this defaults to top the program swamping another web site by repeatidly reading page after page the program will add a delay between reading pages, this defaults to 10 seconds but will be overiden by the robots.txt on the host site.
Could you alter the program to also list all the images on each page?
DB MYSQL localhost db_name user password SQL 'drop table if exists MyImgTab' SQL 'create table MyImgTab( img_url varchar(255) )' LOAD http://www.bikesandkites.com LOOP_URL RECURSE LINK_URL LINK_TXT IF {LINK_URL} IS_IMAGE WRITE ' saving image {LINK_URL}' SQL 'insert MyImgTab values ( "{LINK_URL}" )' END_IF END_URL |
Obviously if you don't have a database to connect to then this will cause an error when it first tries to connect to the database that isn't there. If we want to conect to a Sybase database then alter the DB command from MYSQL to SYBASE. The sql in the drop table command will probably have to alter to just 'drop table MyImgTab'. Rather than having to put a DB command into every program you could just put it once into the global file - this will then get called before running any program.
Could you alter the program to use your own web site?
DB MYSQL localhost db_name user pwd LOOP_SQL 'select img_url from MyImgTab' WRITE 'image stored was {img_url}' END_SQL |
LOAD http://www.exchangerate.com/ LOOP_TABLE 0 WRITE '{COL_2} | {COL_4} | {COL_5}' END_TABLE |
Note the number after the LOOP_TABLE command - it is used to indicate which table on the web page to extract the data from. If you use 0 then the program will take data from the largest table on the page. If you run this using DEBUG then you will be shown how many records each table on the web page has. Just put the command DEBUG at the top of the program to go into debug mode.
Can you alter the program to extract this data into a database table?
/* ** comments are done like this */ LOAD www.bikesandkites.com/beetle.html DEBUG SET var 'I like Apples' REPLACE var 'apples' 'bananas' WRITE {var} RELOAD LOOP beetle GET ' {word_txt} ' WRITE {word_txt} END_LOOP |
str | Can be any string including variables enclosed in curly brackets which will be replaced by their values. Strings without spaces do not need quotes. |
url | Just like str but this time the end value will be treated as a URL. Relative URLs will be expanded into full URLs etc. |
pat | This is a regular expression that provides a pattern to search for. It is case insensitive. |
var | Is just the name of a variable and it is not enclosed in curly brackets. Variable names are always in upper case. Some variable names are predefined ie SEQ, QUEUE, COL_1, COL_2 ... |
SEQ | Starts off at 1 and then increments each time it is accessed. |
QUEUE | When using queues it holds the current item in the queue. |
PCT_DONE | When used with RECURSE it shows the percentage of the queue that has been processed. |
DELAY_BETWEEN_LOADS | Holds the delay in seconds before loading another page. This defaults to 10 seconds but will respect any value held in a web sites robots.txt file. |
MAX_NUM_LOADS | Holds the maximum number of pages that a program can load. It defaults to 300. This is usefull to stop programs from running forever and trying to process the whole of the internet. |
CURRENT_FILE | The name of the last file loaded. |
CURRENT_BASE | The base URL of the last file loaded. |
VERSION | The version of the rip2db interpreter. |
NUM_LOADS | How many loads have been performed so far. |
PCT_DONE | Shows what percentage of the queue has currently been processed. |
DATE | Shows today's date as YYMMDD. |
TIME | Shows the time as HH:MM |
TITLE | HTML title for current web page |
rip2db | This is the interpreter which and it's written in PHP to call it just type php rip2db your_program |
rip2db.log | This holds the log for each run, if you use rip2db a lot then you might want to delete this file every now and then to stop it growing too big. |
save | Any files you pull of the web with the SAVE command are stored in this directory with the basename of the URL. |
global | This is a special file which, if present, is loaded before running any program. It's a good place to put your DB command so that you don't need to specify it in every program. |
080324 16:26 =================================== 080324 16:26 = Running program eg_4 080324 16:26 =================================== 080324 16:26 Loaded http://www.bikesandkites.com (length 2kb) 080324 16:26 Loaded http://www.bikesandkites.com/ppg_flight_log.html (length 82kb) 080324 16:26 Loaded http://www.bikesandkites.com/beetle.html (length 4kb) 080324 16:26 Loaded http://www.bikesandkites.com/bike.html (length 19kb) 080324 16:26 Loaded http://www.bikesandkites.com/cham.html (length 3kb) 080324 16:26 Loaded http://www.bikesandkites.com/kites.html (length 6kb) 080324 16:27 Loaded http://www.bikesandkites.com/int.html (length 2kb) 080324 16:27 Loaded http://www.bikesandkites.com/mame2.html (length 33kb) 080324 16:27 Loaded http://www.bikesandkites.com/pvr.html (length 15kb) 080324 16:27 Loaded http://www.bikesandkites.com/bc_binary_clock.html (length 10kb) 080324 16:27 Loaded http://www.bikesandkites.com/pixalator.html (length 5kb) 080324 16:28 Loaded http://www.bikesandkites.com/lcd_clock.html (length 8kb) 080324 16:28 Loaded http://www.bikesandkites.com/me.html (length 3kb) 080324 16:28 Loaded http://www.bikesandkites.com/personal.html (length 7kb) 080324 16:28 Loaded http://www.bikesandkites.com/index_frames.html (length 1kb) 080324 16:28 Loaded http://www.bikesandkites.com/ppg_faq.html (length 16kb) 080324 16:28 Loaded http://www.bikesandkites.com/ppg_extra_info.html (length 4kb) 080324 16:29 Loaded http://www.bikesandkites.com/index.html (length 2kb) 080324 16:29 Loaded http://www.bikesandkites.com/diamond.html (length 657b) 080324 16:29 Loaded http://www.bikesandkites.com/arcade.html (length 18kb) 080324 16:29 Loaded http://www.bikesandkites.com/lcd_ny_story.html (length 3kb) 080324 16:29 Loaded http://www.bikesandkites.com/scud.html (length 8kb) 080324 16:29 Loaded http://www.bikesandkites.com/danny_boy.html (length 2kb) 080324 16:30 Loaded http://www.bikesandkites.com/ppg_fwd_launch.html (length 5kb) 080324 16:30 Finished eg_4, loaded 24 files with total size 268kb |
Hope you find the language useful but it's provided as is with no guarentees. Please feel free to send me an email with any issues or suggestions.