Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@caseywdunn
Last active September 25, 2017 18:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save caseywdunn/c579e8d404cf4be81a78612e6fb0a2aa to your computer and use it in GitHub Desktop.
Save caseywdunn/c579e8d404cf4be81a78612e6fb0a2aa to your computer and use it in GitHub Desktop.
Regular expressions intro and examples

These examples of regular expressions are taken largely from our book Practical computing for Biologists. More information is available at http://practicalcomputing.org.

This document can be accessed via the delightful github url shortener at https://git.io/dine

Given the following names:

Agalma elegans
Frillagalma vitiazi
Cordagalma tottoni
Shortia galacifolia
Mus musculus

Challenge: Shorten to A. elegans format

Can't just search and replace galma with .

###############################################################

Introduce \w wildcard to delete compass directions

+40 46'N +014 15'E
+21 17'N -157 52'W

Try \w first

then try '\w replaced by '

###############################################################

5th
3rd
2nd
4th

Introduce capture ()

Reduce to just numbers:

(\w)\w\w \1

############################################################### Revisit original challenge:

Agalma elegans
Frillagalma vitiazi
Cordagalma tottoni
Shortia galacifolia
Mus musculus

Challenge: Shorten to A. elegans format

Introduce + quantifier

(\w)\w+ (\w+)	\1. \2

###############################################################

Introduce . escape \

Exercise 1

>CAA58790.1= green fluorescent protein [Aequorea victoria]
MSKGEELFTGVVPILVELDGDVNGQKFSVRGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFLKSAMPEGYVQERTIFYKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKMEYNYNSHNVYIMGDKPKNGIKVNFKIRHNIKDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSQDPHGKRDHMVLLEFVTSAGITHGMDELYK
>AAZ67342.1= GFP-like red fluorescent protein [Corynactis californica]
MSLSKQVLPRDVKMRYHMDGCVNGHQFIIEGEGTGKPYEGKKILELRVTKGGPLPFAFDILSSVFTYGNRCFCEYPEDMPDYFKQSLPEGHSWERTLMFEDGGCGTASAHISLDKNCFVHKSTFHGVNFPANGPVMQKKTLNWEPSSELITAGDGILKGDVTMFLMLEGGHRLKCQFTTSYKAKKAVKMPPNHIIEHRLVRKEVADAVQIQEHAVAKHFIV
>ACX47247.1= green fluorescent protein [Haeckelia beehleri]
MEFEPEFFNKPVPLEMTLRGCVNGKEFMIFGKGEGDASKGNIKGKWILSHSEDGKCPMSWAVLAPTFAYGFKVFAKYPKDFAHFWQDCMPVGYSERRITRFGRLSGNDDIEQEGIMNTYHEVQMRERMVGDEITWIVESRVKLDATINENSPILMNDGLSEYRPNLERTVSFEDGLKNYSQFFYPIKDCETKDYIIANQMTHERPLSKCNKPGRLPPSHFKRTDLEQWKDSKEDKDHIVQEEITAFLLQAQDKDLQSLGIGM
>ABC68474.1= red fluorescent protein [Discosoma sp. RC-2004]
MRSSKNVIKEFMRFKVRMEGTVNGHEFEIEGEGEGRPYEGHNTVKLKVTKGGPLPFAWDILSPQFQYGSKVYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGVVTVTQDPSLQDGCFIYKVKFIGVNFPSDGPVMQKKTMGWEASTERLYPRDGVLKGEIHKALKLKDGGHYLVEFKTIYMAKKPVQLPGYYYVDSKLDITSHNKDYTIVEQYERTEGRHHLFLKAELGSNVGER
>AAQ01183.1= green fluorescent protein 1 [Pontellina plumata]
MPAMKIECRISGTLNGVVFELVGGGEGIPEQGRMTNKMKSTKGALTFSPYLLSHVMGYGFYHFGTYPSGYENPFLHAANNGGYTNTRIEKYEDGGVLHVSFSYRYEAGRVIGDFKVVGTGFPEDSVIFTDKIIRSNATVEHLHPMGDNVLVGSFARTFSLRDGGYYSFVVDSHMHFKSAIHPSILQNGGSMFAFRRVEELHSNTELGIVEYQHAFKTPTAFA

Challenge: Convert the headers from the format: >CAA58790.1= GFP [Aequorea victoria] To: >CAA58790_Aequorea

(>\w+).+\[(\w+) \w+\]
\1_\2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment