Text-only view of web pages

I have added another tool for my site: View Text-only version of a web-page!

Please note- it may take a couple minutes, until a request is processed! I’ll get back later why.

It is very handy if you want to see the semantic structure of a page, or get a glimpse how for example search engines would (roughly) see your page – personally I think a page should look perfectly structured without scripts, css, etc – then modify the display. Accessibility and SEO are probably the two most important reasons for this.

Previously I used Google Cache – text-only version, but nowadays it seems that Google manual penalties are a bit more frequent, resulting in total de-indexing of sites and as such my previous method didn’t work. So I created this little tool – it is nothing fancy, actually pretty much in test-version (if you look at the result-page url), I’d say it is even a bit slow, but does the job, so its available.
The source code – hmmm – not for the moment. If you want it, contact me, no problem, but it is just not tidy enough to be shared publicly – so as and when I have time, I’m still working on it.

In a nutshell:

  1. You can set the basic stream context variables on the page (referer, user-agent), http protocol is set to version 1.1
  2. It loads the html as DOMdocument
  3. Strips the following tags: link, style, img, script, iframe, input – I guess I should have also stripped frames, but they are not supported in html5 anyway, so foolish to use them.
  4. Returns the modified html document

So – nothing fancy, but does the job – if you need any other features or improvements, contact me, and as soon as I have a reasonably tidy and maybe a bit quicker source-code I’ll share it here!

All the best,

Balazs

HTTP Header Response and Source Viewer

Finally – I never thought it will take me so much time to create this simple tool, but anyway – it’s here. Allows you to view full HTTP header responses, following redirects up to 20 hops and gives you the source-codes as well. Balazs’s HTTP Viewer

Instructions:

  1. URL: – has to be the URL you wish to check starting with http:// or https:// etc – full url.
  2. User agent: you can enter anything, if blank it will use your browsers user agent. If you wish, you may find a fairly complete list of user-agents here: http://www.user-agents.org/
  3. Referer – not misspelt… in case if you wondered. If left blank, it will use gorlestonit.uk , but you can set anything if you wish to test server-response based on referers.

HTTP protocol version is set to 1.1 – I see no point in older ones, but if you do, let me know and I will consider adding an option to add it to the parameters.

So –  test it, use it, and if you have any suggestions what would you like to see in this tool, I’ll consider it. And if you find any bugs – please ASAP let me know.

Error messages  -now they are reasonably friendly…

Personally I needed this tool to quickly view redirects on site that seems to have problem in search engines – or the ones that are marked “dangerous” and as such I wouldn’t want to load them in my browser.

Source code? Let me quote from the father of http-viewers (well- the first http-viewer I used): it is good as it is, and indicates the level of consultancy I provide (please consider – mine is only version 0.1…), but the source code….

Is also free :-) – with one condition – you give credit to me for it – it is all dead-simple php, some may say there are “bad practices” in it – but generally in most cases there is a reason, why it’s done how it’s done and it is safe and secure.

If you want, here it is – I’m using two files, but you can combine them to one, if you wish. Also contains the full reCaptcha integration which is fairly useful for spam-protection (see my previous post for the most simple integration):


<script src='https://www.google.com/recaptcha/api.js'></script>
<?php
// define variables and set to empty values
$urlErr = $captchaErr = "";
$url = $user_agent = $referer = "";
// set Site and Secret Key we got from the ReCaptcha API - they only appear in the php-source not th html output so it's OK.
$secret = "your_secret_key";
$sitekey = "your_public_key";

//Now reCaptcha - server side. Has to be on this page, since we post the form to this page.
$postdata = http_build_query(
array(
'secret' => $secret,
'response' => $_POST["g-recaptcha-response"]
)
);

$opts1 = array('http' =>
array(
'method' => 'POST',
'header' => 'Content-type: application/x-www-form-urlencoded',
'content' => $postdata
)
);

$context1 = stream_context_create($opts1);

$result = file_get_contents('https://www.google.com/recaptcha/api/siteverify', false, $context1);
// So we packed the g-recaptcha-response and sent it with our secret ket to the reCaptcha server and in $result we got back a string that happens to be a JSON object which we will need to decoded_result

$decoded_result = json_decode($result, true);
$captcha_response = $decoded_result["success"];
//Due to json_decode (true is necessary to be an associative array) the $captcha_response will be either 1 or empty so we can decide to process the for if there is no error

//Creating a function to strip and sanitize all inputs
function test_input($data) {
$data = trim($data);
$data = stripslashes($data);
$data = htmlspecialchars($data);
return $data;
}

if ($_SERVER["REQUEST_METHOD"] == "POST" and $captcha_response != 1) {
$captchaErr = "Try again - You might be a ROBOT! Or try it again...";
}
if ($_SERVER["REQUEST_METHOD"] == "POST") {
if (empty($_POST["url"])) {
$urlErr = "Please enter URL!";
} else {
$url = filter_var($_POST["url"], FILTER_SANITIZE_URL);
// check if name only contains letters and whitespace
if (filter_var($url, FILTER_VALIDATE_URL) === false) {
$urlErr = "This is not a valid url - try again!";
}
}

if (empty($_POST["user_agent"])) {
$user_agent = $_SERVER['HTTP_USER_AGENT'];
} else {
$user_agent = test_input($_POST["user_agent"]);
}

if (empty($_POST["referer"])) {
$referer = "http://yourdefaultreferer.com";
} else {
$referer = test_input($_POST["referer"]);
}

}

?>

<form method="post" action="<?php echo htmlspecialchars($_SERVER["REQUEST_URI"]);?>">
URL: <input class="form" type="text" name="url" value="<?php echo $url;?>">
<span class="error"><?php echo $urlErr;?></span>
<br /><br />
User Agent: <input class="form" type="text" name="user_agent" value="<?php echo $user_agent;?>">
<br /><br />
Referer: <input class="form" type="text" name="referer" value="<?php echo $referer;?>">
<br /><br /><br />
<div class="g-recaptcha" data-sitekey="<?php echo $sitekey ?>"></div> <span class="error"><?php echo $captchaErr ?></span>
<br />
<input type="submit" name="submit" value="Submit">
</form>

<?php

if ($_SERVER["REQUEST_METHOD"] == "POST" and $urlErr == "" and $captchaErr == "") {
// the message
include 'include/http_results.php';
} elseif ($_SERVER["REQUEST_METHOD"] == "POST") {
echo "<br /><span class=\"error\">You must try this again... <span>";
}
?>

Now if you don’t wish to use reCaptcha just remove the hidden input, the script at the very beginning of the file (you should have put it in the html head normally) and the http – post request generated to verify the response.

Now just  a sidenote: all inputs are validated and sanitized, so no place for Cross-site scripting attacks.

And as you see close to the end of the file, if all good (ie – all input and Captcha is valid) then I included another file (http_results.php) to process the actual requests – but if you wish, you can simply replace the include with all the following php-code.


<?php

$address = $url;

$default_opts = array(
'http'=>array(
'method'=>"GET",
'user_agent'=>$user_agent,
'protocol_version'=>"1.1",
'header'=>"Referer:" . $referer,
)
);

$default = stream_context_set_default($default_opts);

echo "<h2>Options</h2>";

foreach($default_opts as $a => $a_value) {
echo "<br />";
echo $a . ": " . $a_value;
foreach($a_value as $b => $b_value) {
echo "<br />";
echo $b . ": " . $b_value;
echo "<br />";
}
echo "<br />";
}

$headers = get_headers($address, 1);

$new_url = $address;

foreach($headers as $x => $x_value) {
if (gettype($x) == "integer") {
$loc = $x + 1;
$context = stream_context_create(
array (
'http' => array (
'follow_location' => false // don't follow redirects
)
)
);
$html = file_get_contents($new_url, false, $context);
echo "<br /><h2>Location " . $loc . " Source </h2><br />" ;
highlight_string($html);
echo "<br /><strong>Host IP: </strong>" . gethostbyname(parse_url($new_url, PHP_URL_HOST)) . "<br />";
echo "<br /><h2>Headers for Location " . $loc . "</h2><br />";
}
echo $x . ": " . $x_value;
if ($x == "Location") {
$new_url = $x_value;
}
foreach($x_value as $y => $y_value) {
echo "<br />";
echo $y . ": " . $y_value;
echo "<br />";
}
echo "<br /><br />";
}
?>

Basically that’s it. One more thing to mention: this (and generally all my forms) are always designed to be posted to the actual page – but I’m using URL-rewriting. This works fine with it. On a normal contact-from or something you’d send the user, after successful submission to a different Thank-you page or something. For the header-viewer I didn’t do that – since the user might want to submit multiple requests – but the captcha is there…

Any questions, let me know!

B

 

Update –Balazs’s HTTP Viewer – now with a basic code-highlighting!

Google reCaptcha integration example on PHP

Well – a simple one. But quite decent:

  • Single page implementation: the form is posted the the source page (so that if there is a problem with the entered data, the user can see and correct it there and then)
  • All user-input is stripped and sanitized to prevent Cross Site Scripting attacks
  • Instead of $_SERVER[‘PHP_SELF’] I use $_SERVER[“REQUEST_URI”]  – as I use url-rewriting
  • And of course if the form is successfully completed we are sending the user to a thank-you page so if he/she keeps pressing refresh nothing will happen.

Now if you see any mistakes, or want to suggest any improvements feel free to do so – otherwise use it :-)

Balazs


<head>
<script src='https://www.google.com/recaptcha/api.js'></script>
</head>
<style>
.error {
color:red;
}
</style>
<?php
// define variables and set to empty values
$nameErr = $emailErr = $emailconfErr = $captchaErr = "";
$name = $email = $message = $emailconf = "";
/* set Site and Secret Key we got from the ReCaptcha API - they only appear in the php-source not th html output so it's OK. If you just want to
use this for testing from https://developers.google.com/recaptcha/docs/faq?hl=en - if they don't change them:
Site key: 6LeIxAcTAAAAAJcZVRqyHh71UMIEGNQ_MXjiZKhI
Secret key: 6LeIxAcTAAAAAGG-vFI1TnRWxMZNFuojJ4WifJWe
Just copy them to the variable-definitions below
*/
$secret = "your-secret-key";
$sitekey = "your-public-key";

//Creating a function to strip and sanitize all inputs
function test_input($data) {
$data = trim($data);
$data = stripslashes($data);
$data = htmlspecialchars($data);
return $data;
}

//Now reCaptcha - server side. Has to be on this page, since we post the form to this page.
$postdata = http_build_query(
array(
'secret' => $secret,
'response' => $_POST["g-recaptcha-response"]
)
);

$opts = array('http' =>
array(
'method' => 'POST',
'header' => 'Content-type: application/x-www-form-urlencoded',
'content' => $postdata
)
);

$context = stream_context_create($opts);

$result = file_get_contents('https://www.google.com/recaptcha/api/siteverify', false, $context);
/* So we packed the g-recaptcha-response and sent it with our secret key to the reCaptcha server
and in $result we got back a string that happens to be a JSON object which we will need to decoded_result
*/
$decoded_result = json_decode($result, true);
$captcha_response = $decoded_result["success"];
//Due to json_decode (true is necessary to be an associative array) the $captcha_response will be either 1 or empty so we can decide to process the for if there is no error

if ($_SERVER["REQUEST_METHOD"] == "POST" and $captcha_response != 1) {
$captchaErr = "Try again!";
}
if ($_SERVER["REQUEST_METHOD"] == "POST") {
if (empty($_POST["name"])) {
$nameErr = "Please enter your name!";
} else {
$name = test_input($_POST["name"]);
// check if name only contains letters and whitespace
if (!preg_match("/^[a-zA-Z ]*$/",$name)) {
$nameErr = "Only letters and white space allowed in name!";
}
}

if (empty($_POST["email"])) {
$emailErr = "Please enter your email!";
} else {
$email = test_input($_POST["email"]);
// check if e-mail address is well-formed
if (!filter_var($email, FILTER_VALIDATE_EMAIL)) {
$emailErr = "Please check your email - invalid format!";
}
}

if (empty($_POST["emailconf"])) {
$emailconfErr = "Please re-enter your email!";
} else {
$emailconf = test_input($_POST["emailconf"]);
// check if e-mails match?
if ($emailconf != $email) {
$emailconfErr = "E-mail doesn't match - please check!";
}
}

if (empty($_POST["message"])) {
$message = "";
} else {
$message = test_input($_POST["message"]);
}

}

?>

<form method="post" action="<?php echo htmlspecialchars($_SERVER["REQUEST_URI"]);?>">
Name: <input class="text" type="text" name="name" value="<?php echo $name;?>">
<span class="error"><?php echo $nameErr;?></span>
<br /><br />
E-mail: <input class="text" type="text" name="email" value="<?php echo $email;?>">
<span class="error"><?php echo $emailErr;?></span>
<br /><br />
Confirm E-mail: <input class="text" type="text" name="emailconf" value="<?php echo $emailconf;?>">
<span class="error"><?php echo $emailconfErr;?></span>
<br /><br />
Message: <br /><textarea name="message" rows="5" cols="40"><?php echo $message;?></textarea>
<br />
<div class="g-recaptcha" data-sitekey="<?php echo $sitekey ?>"></div> <span class="error"><?php echo $captchaErr ?></span>
<br />
<input type="submit" name="submit" value="Submit">
</form>

<?php

if ($_SERVER["REQUEST_METHOD"] == "POST" and $nameErr == "" and $emailErr == "" and $emailconfErr == "" and $captchaErr == "") {
// the message
$msg = "\r\n\r\nNew message from yourwebsite.com" . "\r\n\r\n" . "From: " . $name . "\r\n\r\n" . "From Email: " . $email . "\r\n\r\n" . "Message:" . "\r\n" . $message;

// use wordwrap() if lines are longer than 70 characters
$msg = wordwrap($msg,70);

// send email
$to = "you@yourwebsite.com";
$subject = "New enquiry - IT Gorleston: " . $name;
$headers = "From: " . $email . "\r\n" .
"Reply-To: " . $email . "\r\n" .
'X-Mailer: PHP/' . phpversion();

mail($to, $subject, $msg, $headers);

//In this case - email successfully sent with all enquiry data, we are sending the user to a thank-you page (you will need to add your own) and stop the script.
header("Location:/thank_you.html");
exit; // Location header is set, pointless to send HTML, stop the script
} elseif ($_SERVER["REQUEST_METHOD"] == "POST") {
echo "<br /><span class=\"error\">Your message haven't been sent! Please correct the mistakes above!<span>";
}
?>