Helpful Information
 
 
Category: Regex Programming
Grab HTML from Source code pasted into Text Field

I posted this question under the PHP forum and to reduce the risk of spaming the boards I am going to just link this one to that post.

Link to PHP Board Post (http://forums.devshed.com/php-development-5/grab-html-from-source-code-pasted-into-text-field-583752.html)

Can anyone point me in the right direction on this.

Thanks

Hey, you look familiar. Have we met? Sorry to be a pain but I'm trying to get people to post their regex questions in the right forum.


$before = "The status of the <B>[^<]+</B> of <B>[^<]+</B>\.<P>";
$after = '<TR><TD>Net Income</TD><TD ALIGN=RIGHT>\$[\d,]+</TD></TR>';
preg_match("!$before(.*?)$after!is", $text, $matches);
// $matches[1] is the text in between

$before = "<TABLE>
<TR><TH COLSPAN=2 BGCOLOR=#000040>Land Distribution</TH></TR>";
$after = "<TR><TD>SDI</TD><TD align=right>[\d,]+</TD></TR></TABLE></TD></TR>
</TABLE>";
preg_match("!$before(.*?)$after!is", $text, $matches);
// again, $matches[1] is the text in between
I think that should suffice.

little confused here, how does this code work... I would like to understand it instead of just asking for someone to do it for me.

I don't understand how this will print or echo back to the user.

I tried using the code snippet you provided and I get nothing back.

would really like to understand how this is suppose to work.

thanks for the reply.

...
I think that should suffice.

Since that input is coming from a text field filled by a user, I highly doubt that it will always be the same. So you won't have a fixed $before and $after string.

I posted this question under the PHP forum and to reduce the risk of spaming the boards I am going to just link this one to that post.

Link to PHP Board Post (http://forums.devshed.com/php-development-5/grab-html-from-source-code-pasted-into-text-field-583752.html)

Can anyone point me in the right direction on this.

Thanks

Before being able to answer your question (with an explanation), could you explain the rules for the strings you want preserve, or the other way around: explain the rules for the strings that should be removed. You gave just a single source and said I want this and that to be preserved, but what about other forms of input?

Before being able to tell the regex engine what should and should not be removed, you should explain it in great detail here.

Good luck.

Since that input is coming from a text field filled by a user, I highly doubt that it will always be the same. So you won't have a fixed $before and $after string.
Right. Which is why I looked at the two strings and guessed what parts of them would change and what would not.
If I missed something he would likely mention it.


little confused here, how does this code work... I would like to understand it instead of just asking for someone to do it for me.

I don't understand how this will print or echo back to the user.

I tried using the code snippet you provided and I get nothing back.

would really like to understand how this is suppose to work.

thanks for the reply.
It stuffs that "everything from... to..." into two variables. You're supposed to... I don't know, all you said was that you wanted what was in between them.
I assumed you were going to do something else, like use an HTML parser, or maybe just print out the stuff literally. You didn't really say what you were going to do next and I didn't ask.

Literally, all that code does is search for the $before string (which is generalized a bit) and the $after string (also generalized) and get everything in between them. That's it.
In both cases, $matches[1] contains the text. If you want to do something then you use that. Keep in mind that it contains HTML so if you simply echo/print it out then you'll get (invalid) HTML-formatted text.

If you need an explanation or tutorial on regular expressions then check the sticky here: it has a bunch of links you should look at.


If it's not working then

could you explain the rules for the strings you want preserve, or the other way around: explain the rules for the strings that should be removed. You gave just a single source and said I want this and that to be preserved, but what about other forms of input?

Before being able to tell the regex engine what should and should not be removed, you should explain it in great detail here.

Right. Which is why I looked at the two strings and guessed what parts of them would change and what would not.
...

Aha, I should have read your post with more attention: I missed that completely! Sorry.

The data I posted in the other thread is a page called advisor in an online game.

This page is the same for all users, except the data in the table and the main Title.

Everything else is the same.

an example of what I am trying to is at this link.
http://evolution2025.com/qzStatusTidy.php

Copy the code (Source Code) I posted before, and past it into that page and check the preveiw table box, and click the button....

This will show you what I am trying to learn how to do.

I wish I could explain more, but like I said, I am trying to learn how to do this, but not sure where to start.

Thanks :cool:

been messing around with this, and got it to return the $before lines of each block of code, but is will not return the other data nor the $after lines

Also it is not stripping the javascript or other tags out, just clearing all white space from the code.

Why not just put it into an array and then you can format it any way you want. This is an old script that still works...



<?php

// $html would be the Status Report you want to process!

// sub string from where we need our data to where our data ends...

// start

$html = substr ( $html, stripos ( $html, 'the status' ) );

// end

$html = substr ( $html, 0, strripos ( $html, '<br><br>' ) );

// setup the html... (get it ready to convert);

$regex = array ( '#<th.*>#Uis', '#<tr.*>#Uis',
'#<td.*>#Uis', '#<\/?table.*>#Uis',
'#<\/th.*><\/tr.*>#Uis', '#&nbsp;#Uis',
'#<\/tr.*>#Uis', '#<\/td.*>#Uis' );

$replace = array ( '<th>', '<tr>',
'<td>', '',
'', '',
'</tr>', '</td>' );

$html = preg_replace ( $regex, $replace, $html );

// split the data up starting at each title element (header)

$parts = explode ( '<tr><th>', $html );

// set our output container

$out = array ();

// set our main header text

$out['header'] = strip_tags ( $parts[0] );

// remove $parts[0] = (our header text) and reset the $parts array!

array_shift ( $parts );

// now build our data array

foreach ( $parts AS $data )
{
// split the data fields (IE: <td>name</td>||<td>value</td>)

$data = str_replace ( '</td><td>', '</td>||<td>', $data );

// create a new data array for each new (<tr><td>)

$data = explode ( '<tr><td>', $data );

// the first element $data[0] is always the header for this data block

$header = trim ( array_shift ( $data ) );

// go through each <TR> tag set (IE: <tr><td>? = name</td><td>? = value</td</tr>)

foreach ( $data AS $item )
{
// get the name value pairs

list ( $name, $value ) = array_map ( 'trim', explode ( '</td>||<td>', substr ( $item, 0, strpos ( $item, '</td></tr>' ) ) ) );

// there is one case where the structure may cause a false positive, so we catch it here...

if ( ! empty ( $name ) && ! empty ( $value ) )
{
$out[$header][$name] = $value;
}
}
}

// print out the result array...

print_r ( $out );

?>

It will output this...


Array
(
[header] => The status of the Republic of Canyon Land (#750).



[The Basics] => Array
(
[Turns Left] => 10
[Turns Taken] => 2938
[Rank] => 38
[Networth] => $18,143,148
)

[Current Status] => Array
(
[Money] => $238,681,220
[Population] => 501,921
[Land] => 18865 Acres
[Food] => 2,165,120 bushels
[Production] => 7 bushels
[Consumption] => 23,525 bushels
[Net Change] => -23,518 bushels
[Oil] => 1,140,920 barrels
)

[Economics] => Array
(
[Tax Revenues] => $10,375,953
[Tax Rate] => 35%
[Per Capita Income] => $59.06
[Expenses] => $4,419,146
[Military] => $4,112,567
[Alliance/GDI] => $117,929
[Land] => $188,650
[Net Income] => $5,956,807
)

[Land Distribution] => Array
(
[Enterprise Zones] => 8663
[Residences] => 8663
[Industrial Complexes] => 260
[Military Bases] => 960
[Construction Sites] => 300
[Unused Lands] => 19
)

[Military Forces] => Array
(
[Spies] => 218,990
[Troops] => 4,807,197
[Jets] => 9,142,002
[Turrets] => 4,348,609
[Tanks] => 1,328,015
[Nuclear Missiles] => 5
[Chemical Missiles] => 12
[Cruise Missiles] => 9
)

[Technology] => Array
(
[Military] => 288,889
[Medical] => 18,716
[Business] => 508,812
[Residential] => 509,326
[Agricultural] => 2316
[Warfare] => 2931
[Military Strategy] => 9334
[Weapons] => 827
[Industrial] => 8651
[Spy] => 3587
[SDI] => 190,345
)

)

Thanks, but I only want to pull the table out of the code... Putting it into an array is not useful, I want the output to look just like the orginal table, but I am going to add some things in after it strips the useless data off. :chomp:










privacy (GDPR)