Automatically linking Twitter @usernames in PHP


I keep seeing people writing scripts that embed their Twitter feed into their websites. The “easy” way is to use Javascript, which means you don’t need to have PHP installed on your server. Doing it this way means your tweets will not be visible to visitors with Javascript disabled.

Really, nobody has Javascript disabled in their browsers anymore. The web is pretty much inaccessible without it at this point. However there are still some very important “visitors” that crawl around the web without javascript. I’m talking about search engines. Few if any search engines will actually execute Javascript on your site when crawling it for content. This means anything you have hidden inside a <script> tag will be hidden to them. If you want your tweets to be indexed as part of your page, then you’ll need to use PHP or another server-side scripting language to embed them into your page. This also has the other advantage of making your page load faster to regular visitors as well.

The below diagram should help illustrate the benefit of using PHP to embed your tweets.

Now that you’re convinced that you want to use PHP to embed your Twitter posts, you’re going to quickly run into the problem that people’s Twitter usernames are not given as a link in the RSS feed, but just the @username text. You probably want these usernames linked back to twitter.com.

I have seen some solutions involving splitting up the tweet into individual words, and looking at each to see if it begins with an @ sign. This involves a lot of code, and generally looks something like this:

It is rendered completely unnecessary by using one line of regular expressions!

$tweet = preg_replace('/(^|[^a-z0-9_])@([a-z0-9_]+)/i', '$1<a href="http://twitter.com/$2">@$2</a>', $tweet);

This regular expression is actually pretty simple. (updated) The key part is “(^|[^a-z0-9_])@([a-z0-9_]+)”, which is a lot less scary than it first looks. The ( ) are used to capture what’s inside them so that you can access it later (by using the $1 and $2 above). The [ ] match a set of characters, which can be defined as a range or a list of characters. We’re matching numbers and letters and the underscore. Finally, the + says “one or more”. The vertical bar | is used to match either what’s on the left or what’s on the right. The caret ^ (if it’s not inside square brackets) matches the beginning of a line.

So in English, this regular expression is looking for either the start of a line or a character other than a letter or number or underscore, followed by an @ sign, followed by one or more numbers or letters or the underscore, and storing those characters in the variable $2. This string is then replaced with the HTML code you see above, where $2 is set to the username by the regular expression.

Now that you understand the regular expression above, let me further complicate things by showing you how to make text that begins with http:// into a real hyperlink.

$search = array('|(http://[^ ]+)|', '/(^|[^a-z0-9_])@([a-z0-9_]+)/i');
$replace = array('<a href="$1">$1</a>', '$1<a href="http://twitter.com/$2">@$2</a>');
$tweet = preg_replace($search, $replace, $tweet);

Trust me, it isn’t that bad really. The new regular expression is actually simpler than the first, but is looking for http:// instead of @. You may have also noticed that I switched from using // to ||. You can use any character as the bounds for the regular expression. The advantage of using | is that the bar doesn’t appear inside ever. If I used / as the bounds, then had http:// inside, I’d have to escape the forward slashes of the http. (It would look like http:\/\/, which is kind of ridiculous).

You might want to check out http://www.regular-expressions.info to learn more about regular expressions. Regular expressions are an extremely powerful tool you will want to add to your arsenal when learning PHP.

, ,

  1. #1 by Matt on March 5th, 2009

    Nice post! I’d like to offer one alternative to your expression if you don’t mind. Sometimes in text you’ll see the following kind of status update:

    @Joe: I am eating lunch with @Jane.

    The regular expression you’re using here would catch @Joe: and @Jane. but the colon and period aren’t part of their usernames. Twitter’s usernames can only be alphanumeric or include the underscore.

    If you change the expression to:
    @([\w_]+)

    then you will then only get the username and avoid characters like colon’s, period’s, etc…

    Cheers.

  2. #2 by aaron on March 5th, 2009

    @Matt: Good thinking. I’ve updated the post accordingly! I actually used [0-9a-Z_] as the character class because I like being more explicit for readability’s sake.

  3. #3 by Remco Tolsma on March 9th, 2009

    @aaron The second example is not working correctly. You should first replace the links within the tweet and then the @username.

    $search = array(’|(http://[^ ]+)|’, ‘|@([0-9a-z_]+)|’);
    $replace = array(’$1‘, ‘@$1‘);

  4. #4 by aaron on March 25th, 2009

    @Remco thanks for catching that. I’ve updated the post.

  5. #5 by Cometbus on June 30th, 2009

    Even with the fixes in the comments, consider this:

    @cometbus have you seen @ahem’s last tweet? You can email me at foo@example.com. #contactme@email.

    You will get :
    @cometbus
    @ahem
    @example
    @email

  6. #6 by aaron on June 30th, 2009

    @cometbus

    That actually just occurred to me the other day when I was writing some code to parse some text that has an equal chance of containing Twitter usernames as it does email addresses. Typically tweets don’t contain email addresses so it isn’t usually a problem.

    Here is my updated regex.

    preg_replace(’/(^|[^a-z0-9_])@([a-z0-9_]+)/i’, ‘$1<a href=”http://twitter.com/$2″ >@$2</a>’, $tweet);

    This requires either a non-alphanumeric character preceding the @ sign, or matches the username at the beginning of the line.

  7. #7 by larry on July 29th, 2009

    Don’t forget to include A-Z in the for capitalized letters

  8. #8 by Ferodynamics on July 30th, 2009

    I doubt Twitter scans their database to see if a username actually exists (before linking to it) because that would slow things down. I saw a tweet today on Twitter’s website: “sold @2.45″ meaning “sold at $2.45″ and Twitter assumed this was user “2″ cut off at the period. Twitter.com/2 actually exists, which is all the more tragic. So if I incorrectly match an email address, I hereby apologize in advance ;-)

  9. #9 by aaron on July 31st, 2009

    @larry: the /i flag makes the regex case-insensitive, so it’s not necessary to include A-Z. /[a-zA-z]/ and /[a-z]/i are equivalent expressions.

  10. #10 by Medical Revision on September 3rd, 2009

    Hey man – thanks a lot for this. You helped me understand regex a lot better. Hopefully will have it working properly on allaboutchris.co.uk soon!

  11. #11 by thanking you! on November 13th, 2009

    thanks. just what i needed. works perfectly =)

    thank you.

(will not be published)