Tip:
Highlight text to annotate it
X
ALL RIGHT, GOOD MORNING EVERYONE. THANKS FOR COMING TODAY.
JUST A FEW HOUSEKEEPING THINGS THIS MORNING. FIRST THE USUAL, DISCLAIMER ABOUT CME FOR
THOSE OF YOU WHO ARE WISHING TO EARN CME CREDITS FOR ATTENDANCE MAKE SURE THAT YOU SIGNED IN
TODAY OR AT THE TELECONFERENCING SITE. THE PAST LECTURES ARE NOW BEING POSTED AT
TWO DIFFERENT PLACES, I JUST WANT TO MAKE YOU AWARE OF THIS.
FIRST ON THE N.I.H. VIDEO CASTING SITE YOU CAN GO TO THIS VERY LONG URL HERE OR EASIER
WOULD JUST BE TO GO TO OUR ON COURSE WEBSITE AT GENOME.GOV/COURSE 2010.
FOLLOW THE LINKS, THERE'S A DIRECT LINK TO THIS PAGE YOU'LL SEE LAST TWO LECTURES HAVE
ALREADY BEEN POSTED ALL YOU HAVE TO DO IS JUST CLICK ON "PLAY VIDEO" AND THAT WILL POP
OPEN THE VIEWER AND YOU'LL BE ABLE TO WATCH THE LECTURES.
IF YOU HAVE MISSED A LECTURE OR WANT TO CATCH UP OR JUST REVIEW SOME OF THE CONCEPTS PLEASE
TEAL FREE TO TAKE ADVANTAGE OF THAT. ANOTHER SOURCE WHERE YOU CAN GET THE LECTURES
IS GENOME TV, THIS IS A NEW CHANNEL THAT THE GENOME INSTITUTE HAS PUT UP WITH VARIETY OF
LECTURES SO HERE YOU CAN JUST SEE A SEQUENCING GROUP BUSILY WORKING PREPARING A LOT OF THE
SAMPLES THAT WE TALK ABOUT IN THE COURSE OF THIS LECTURE SERIES.
IN THIS CASE FROM THE CLINICAL SEQUENCING EFFORT.
ERIC'S LECTURE FROM THE VERY FIRST WEEK IS RIGHT HERE UNDER THE FEATURE COLUMN.
WE'RE HAVING A NEW PLAY LIST PUT TOGETHER YOU CAN FIND ALL OF THE LECTURES ALL IN ONE
PLACE THERE AS WELL JUST, WHERE EXACTLY THE SAME WAY AS IF YOU HAVE GOB TO YOUTUBE.
SO, WITH THAT, LET'S GO AHEAD GET IN TO THIS WEEK, WE'LL JUST PICK UP WHERE WE LEFT OFF
LAST WEEK HOW TO TALK ABOUT FURTHER ANALYZE SEQUENCES.
PRIMARILY AT THE PROTEIN LEVEL AND AS YOU'LL NOTICE WE'VE CONCENTRATED AT THE PROTEIN LEVEL
THIS MAY SEEM A LITTLE BIT ODD FOR A COURSE CALLED CURRENT TOPICS.
WE DO THIS, WE WANT TO REINFORCE THE IMPORTANCE OF THINKING AT BEAU LEVELS, THE DNA LEVEL
AND AT THE NUCLEUS, PROTEIN LEVEL WHEN DOING ANALYSES.
OBVIOUSLY ADVANCES IN GENOME SCIENCE MAKE IT INCREDIBLY EASY, MUCH EASIER FIND MUTATIONS,
CHROMOSOMAL ABERRATIONS, LOOK AT CHANGES IN EXPRESSION PATTERN AND SIMILAR EVENTS TAKING
PLACE AT THE NUCLEOTIDE LEVEL. WE NEED TO TRANSLATE THOSE EVENTS THAT ARE
TAKING PLACE AT THE NUCLEOTIDE LEVEL OVER TO THE PROTEIN LEVEL TO KEEP IN MIND THAT
THE PROTEINS ARE A WORK HORSE IN THIS CELL. AND SO BY GOING THROUGH ALL OF THIS MATERIAL
FOR THOSE OF YOU WHO ARE FOCUSED MORE ON THE BASIC RESEARCH SIDE, THIS WILL HOPEFULLY HELP
YOU THINK ABOUT YOUR EXPERIMENTAL DESIGN A LITTLE BIT BETTER.
AND ADVANCE YOUR UNDERSTANDING OF HOW THESE MUTATIONS AT THE NUCLEOTIDE LEVEL AFFECT THINGS
LIKE STRUCTURE AND FUNCTION. THOSE OF YOU THINKING OF CLINICAL QUESTIONS
YOU MIGHT HAVE DETECTED A MUTATION IN A PATIENT AND YOU MIGHT HAVE BEEN DOING SOME GENETIC
SUSCEPTIBILITY TEST, SOME OTHER TARGET TESTING YOU NEED TO UNDERSTAND THE NET AFFECT OF THOSE
MUTATIONS IN YOUR PATIENTS TO BETTER UNDERSTAND WHY AM I SEEING THE PHENOTYPES THAT I'M SEEING
TO START TO GET INSIGHT IN TO METABOLIC CHANGES THAT MIGHT BE TAKING PLACE IN THOSE PATIENTS.
TO HELP DETERMINE ARE WHICH MUTATIONS ARE POTENTIALLY PATHOGENIC.
SO, THIS WILL IN TURN HELP YOU HOPEFULLY START TO THINK ABOUT THERAPEUTIC APPROACHES.
YES, WE ARE FOCUSING PRIMARILY ON THE PROTEIN SIDE OF THE HOUSE BUT I JUST WANT TO DRIVE
HOME WHY THAT'S IMPORTANT. WE'RE GOING TO TALK TODAY ABOUT THINGS LIKE
PROFILES, PATTERNS, MOTIF AND DOMAIN. TALKING ABOUT PROTEIN SECONDARY STRUCTURES.
WE'RE GOING TO MOVE TO THE THREE DIMENSIONAL LEVEL SO GOING AWAY FROM JUST LOOKING AT STRINGS
OF LETTERS AND NOW LOOKING AT THREE DIMENSIONAL STRUCTURES I WILL TELL YOU ABOUT ANALYSIS
TOOL THAT'S VERY EASY TO USE AND THREE DIMENSIONAL VIEWER.
AND FINALLY END UP TALKING ABOUT MULTIPLE SEQUENCE ALIGNMENTS, THIS TEAM WILL COME UP
OVER AND OVER AGAIN IN THE REST OF THIS COURSE WHEN WE START TO LOOK AT THE GENOME BROWSER,
THEN OTHER TECHNIQUES THAT ARE AVAILABLE TO US TO DO ADVANCED GENOME ANALYSIS.
WE'LL START WITH SEQUENCE COMPARISONS ONCE AGAIN.
AND THE APPROACH THAT WE USE LAST TIME AROUND WAS HOMOLOGY SEARCHES.
WE DID THESE ONE AGAINST ONE SEARCHES TAKING A SEQUENCE AT THE INTEREST AND COMPARING THAT
AGAINST SET OF SEQUENCES, PROBABLY IN A PUBLIC DATABASE TO FIND OTHER SEQUENCES THAT ARE
SIMILAR TO THE ONE THAT WE STARTED WITH. AND THE METHOD THAT WE SPENT MOST OF OUR TIME
TALKING ABOUT LAST TIME AROUND WAS BLAS AND ALSO BLAS A.
ONE AGAINST ONE COMPARISONS AGAINST A LARGE COLLECTION OF SEQUENCES.
WE CAN ALSO TAKE A SLIGHTLY DIFFERENT APPROACH AND LOOK AT THE COLLECTIVE CHARACTERISTICS
OF PROTEIN FAMILIES TO FIND SIMILARITIES BETWEEN PROTEIN SEQUENCES.
NOW, THESE SEARCHES CAN BE ONE AGAINST MANY. I'LL TELL YOU ABOUT SEVERAL APPROACHES TO
DO THAT. OR MANY AGAINST ONE, WHERE YOU'VE GOT A COLLECTION
OF SEQUENCES WHERE YOU'RE TRYING TO FIND A NEW SEQUENCE OF INTEREST I'LL TELL YOU ABOUT
A BLAST RELATED METHOD THAT CAN DO THAT. WE'LL START WITH THE ONE AGAINST MANY.
AND TALK ABOUT PROFILES. COUPLE OF DEFINITIONS AGAIN TO START OFF TODAY'S
LECTURE. WHENEVER I TALK ABOUT A PROFILE.
A PROFILE JUST QUITE SIMPLY IS NUMERICAL REPRESENTATION OF A MULTIPLE SEQUENCE ALIGNMENT.
I CAN TAKE ANY MULTIPLE SEQUENCE ALIGNMENT AND REPRESENT THAT AS A MATRIX, THE SAME WAY
THAT WE TALKED ABOUT OUR SCORING MATRICES LAST WEEK.
THOSE -- IF YOU RECALL WE TALKED ABOUT THE BLOSSOM SERIES OF MATRICES THAT WERE DERIVED
FROM MULTIPLE SEQUENCE ALIGNMENT AND USING THOSE WE CAME UP WITH THE MATRICES, THEY CONVEYED
TO US THINGS LIKE, WHEN CAN SUBSTITUTIONS TAKE PLACE AND WHERE IMPORTANT RESIDUES HAD
TO BE CONCERNED. THESE PROFILES, LIKE LAST WEEK, DEPEND ON
PATTERNS OR MOTIF THAT CONTAIN THE RESIDUES AND THEY REPRESENT THE COMMON CHARACTERISTICS
OF A PROTEIN FAMILY. SO, BECAUSE OF THAT AND THE POWER OF USING
A COLLECTION OF SEQUENCES AS YOUR BASIS FOR COMPARISON WE CAN START TO FIND SIMILARITIES
BETWEEN SEQUENCES THAT HAVE LITTLE TO NO SEQUENCE IDENTITY.
SO, I OFFERED YOU AN EXAMPLE TO MAKE THIS POINT, THE HOMEODOMAIN FAMILY, IF YOU LOOK
AT THE HOMEODOMAIN, BUT ONLY A HANDFUL OF RESIDUE, LESS THAN TEN ARE CONSERVED AMONGST
ALL OF THE SEQUENCES IN THE CLASS. AND IF YOU WERE TO USE BLAST, WHICH WE TALKED
ABOUT QUITE A BIT LAST WEEK, YOU WOULDN'T ACTUALLY FIND ALL OF THE HOMEO-DOMAIN SEQUENCES
THAT ARE IN THE PUBLIC DATABASES. WE NEED TO ADD SOME ADDITIONAL TOOLS TO OUR
ARSENAL TO HELP US FIND DIBSAL SIMILARITIES AND MAKE ADDITIONAL BIOLOGICAL CONCLUSION
WHERE BLAST WON'T QUITE GET US ALL THE WAY. MORE THINGS TO PUT IN TO YOUR TOOL KIT.
[ NOT AUDIBLE ] YES, THERE IS.
I MENTIONED LAST WEEK THE DEFINITION OF SIMILARITY IS WHEN YOU HAVE YOUR IDENTICAL RESIDUES PLUS
CONSERVATIVE SUBSTITUTIONS. THOSE MIGHT HAVE -- YOU HAVE A PERCENT IDENTITY
AND PERCENT SIMILARITY. AGAIN, PERCENT SIMILARITY INCLUDES CONSERVATIVE
SUBSTITUTIONS. HOW DO WE ACTUALLY CONSTRUCT ONE OF THESE
PROFILES? SO, AGAIN, THE PROFILES ARE BASED ON MULTIPLE
SEQUENCE ALIGNMENTS. HERE I HAVE A MULTIPLE SEQUENCE ALIGNMENT,
THERE'S A NUMBER OF SEQUENCES HERE, TEN ACROSS. YOU'LL NOTICE THAT I'VE HIGHLIGHTED SOME OF
THESE POSITIONS IN RED. SO THE LAST POSITION YOU'LL SEE THAT THERE
IS ALWAYS A GLYCINE RESIDUE IN THE TENTH POSITION. I ALSO HAVE ONE IN THE 8TH POSITION.
IF YOU LOOK AT THE 9TH POSITION MOST OF THE TIME YOU HAVE PROLINE BUT SOMETIMES YOU HAVE
ANOTHER THROWN IN AS WELL. WE'LL TAKE A LOOK AT THIS.
WE'LL ASK FOUR QUESTIONS. SEE, OKAY, WHAT RESIDUES ARE SEEN AT EACH
ONE OF THE POSITIONS. WE GET AN IDEA OF THE FREQUENCY OF AMINO ACIDS
AT EACH POSITION. WHAT'S ALLOWED.
WHAT IS THE FREQUENCY OF THOSE OBSERVED RESIDUES. WHICH POSITIONS ARE CONSERVED, EITHER OUT
RIGHT ABSOLUTELY CONSERVED OR JUST CONSERVATIVELY WHERE WE SEE CONSERVATIVE SUBSTITUTION.
FINALLY WHERE WE CAN INTRODUCE GAPS. WE DON'T HAVE ANY GAPS IN THIS PARTICULAR
EXAMPLE BUT GAPS CAN CERTAINLY EXIST IN THESE ALIGNMENTS.
BASED ON THOSE FOUR QUESTIONS WE CAN CONSTRUCT SOMETHING CALLED THE POSITION SPECIFIC SCORING
TABLE. AND THIS TABLE ALL OF THESE NUMBERS IN THIS
TABLE REPRESENT THE ANSWERS TO THOSE FOUR QUESTIONS.
LET ME TAKE YOU THROUGH THIS, YOU'LL SEE ACROSS THE TOP YOU HAVE EACH ONE OF THE AMINO ACIDS
GOING DOWN THE SIDE IS THE CONSENSUS AT EACH POSITION IN OUR MULTIPLE SEQUENCE ALIGNMENT.
HERE WE HAVE POSITION ONE HERE, POSITION ONE IS AT THE TOP OF THE TABLE, POSITION TEN IS
AT THE BOTTOM. WE BASICALLY JUST TURNED THIS ALIGNMENT ON
TO ITS SIDE. YOU'LL REMEMBER THAT THERE'S ALWAYS A G IN
THAT FINAL POSITION, IN POSITION TEN. IF WE LOOK AT THE G HERE, AND IF WE LOOK AT
THE G IN THE AMINO ACIDS GOING ACROSS THE TOP.
LOOK WHERE THOSE TWO INTERSECT, WE SEE VALUE OF 150.
AND 150 IS THE LARGEST NUMBER IN THE TABLE. SO, ANY TIME YOU HAVE A POSITION WHERE THAT
RESIDUE ABSOLUTELY, POSITIVELY, MUST APPEAR YOU ARE GOING TO ASSESS A VERY, VERY HIGH
POSITIVE SCORE. DRAWING ANALOGY BACK TO THE BLOSSOM TABLE
LAST WEEK, ANY TIME WE SAW A CONSERVED RESIDUE WE ALWAYS GAVE THOSE CONSERVED MATCHES THE
HIGHEST POSSIBLE SCORE. LET'S CONSIDER NOW THE NEXT TO LAST POSITION,
AGAIN MOST OF THE TIMES WE HAVE A PROLINE BUT SOMETIMES WE HAVE A THREAMINE.
WE'LL TAKE A LOOK OVER HERE AT THE P SEE WHERE IT INTERSECTS WITH THE P ACROSS THE TOP.
AS THIS TIME IT SAYS 89. NOT AS MANY AS THE 150, TO REFLECT THE FACT
THAT THIS IS A CONSERVATIVE SUBSTITUTION. THESE RESIDUES CAN SUBSTITUTE FOR ONE ANOTHER,
IT WOULD BE BETTER TO HAVE THE EXACT MATCH BUT WE WANT TO AT LEAST ALLOW FOR THAT WIGGLE
AS WELL. FINALLY LET'S JUST TAKE A LOOK AT THE SECOND
POSITION WHERE YOU'VE GOT JUST ABOUT EVERYTHING GOING ON, CONSENSUS HERE IS A PROLINE IF WE
LOOK ACROSS TO WHERE THE PROLINE IS WE HAVE A MUCH LOWER POSITIVE SCORE.
YOUR BEST SCORES ARISE WHEN YOU HAVE AN EXACT MATCH OF THAT RESIDUE AT THAT POSITION THAT
IS ABSOLUTELY CONSERVED, THEN THE SCORES START TO GO DOWN AS POSITIONS BECOME LESS AND LESS
CONSERVED. FORTUNATELY YOU DON'T HAVE TO GENERATE THESE,
THESE ARE ALL GENERATED FOR YOU. AND CAN YOU CAN NOW USE THESE AND COMPARE
YOUR SINGLE SEQUENCE OF SEQUENCE AGAINST HUNDREDS OF THESE.
IN ESSENCE THIS TAKES THE PLACE OF A SEQUENCE. SAME WAY SEQUENCE A TO SEQUENCE B.
NOW WE'RE GOING TO SAY, I HAVE SEQUENCE A. SEQUENCE B IS MY COLLECTION OF THESE POSITION
SPECIFIC SCORING TABLES. SECOND IS A PATTERN.
IN THE CASE OF PROFILES WE HAVE TWO TYPES OF INFORMATION.
WE HAVE POSITIONAL INFORMATION, WHAT RESIDUES CAN APPEAR AT A GIVEN POSITION.
BUT WE ALSO HAVE FREQUENCY INFORMATION, WHICH IS CAPTURED IN THERE TABLE BY ALL OF THOSE
NUMBERS. HERE, WE JUST HAVE A SHORTHAND TO REPRESENT
TO US WHAT RESIDUES CAN EXIST AT A GIVEN POSITION. WE HAVE NO FREQUENCY INFORMATION.
HERE JUST WHAT IS ALLOWED WHERE. NOT EXACTLY INTUITIVE HOW TO INTERPRET THIS
LET ME TAKE YOU THROUGH THIS. AT THE VERY BEGINNING WE HAVE A THESE IN SQUARE
BRACKETS. WHAT THEY CONVEY TO YOU IS ONE OF.
SO IN POSITION ONE OF THAT MOTIF YOU HAVE TO HAVE EITHER ONE OR THE OTHER.
IN THE NEXT POSITION WE SEE AN X. THE X MEANS ANYTHING.
YOU CAN HAVE ANY AMINO ACID AT THAT POSITION. WE THEN SEE A SINGLE AMINO ACID, A CYCSTEINE
WITH NO EMBELLISHMENTS AROUND IT THAT MEANS THAT YOU ABSOLUTELY HAVE TO HAVE A CYCSTEINE
AT THAT POSITION. AGAIN, HERE IS THE X BUT NOW THE X IS FOLLOWED
BY A TWO. THAT JUST MEANS TWO OF THEM.
SO ANY TWO AMINO ACIDS IN A ROW. NOW, AGAIN TWO, AMINO ACIDS ARE HERE, INSTEAD
OF THE SQUARE BRACKETS WE HAVE CURLY BRACKETS MEANS THE OPPOSITE OF THE SQUARE.
IT MEANS ANYTHING OTHER THAN THESE. AGAIN, THE X SO ANY AMINO ACID AND FINALLY
HERE THE SAME NOTATION WE'VE SEEN OVER HERE, HERE WE'VE GOT A HISTAMINE, THE NUMBER THREE,
JUST THREE IN A ROW. THIS DEFINES A PATTERN THAT MATCHES EVERY
SINGLE MEMBER OF THAT PARTICULAR CLASS OF PROTEINS.
A LITTLE BIT DIFFERENT APPROACH. BUT BOTH OF THEM ARE QUITE USEFUL.
LET'S GO AHEAD PUT THESE BOTH IN TO PRACTICE AND TALK ABOUT OUR FIRST DATABASE OF THE MORNING.
THIS IS SOMETHING CALLED PFAM STANDS FOR PROTEIN FAMILIES AND JUST A COLLECTION OF THESE PROFILES,
THESE MULTIPLE ALIGNMENTS OF PROTEIN DOMAINS AND CAN SERVE REGIONS.
HOPEFULLY THESE REPRESENT REGIONS WITH SOME SORT OF STRUCTURAL SIGNIFICANCE OR SOME SORT
OF FUNCTIONAL IMPORTANCE. WHEN WE LOOK AT THESE ENTRIES, WE'LL SEE AN
EXAMPLE IN A COUPLE OF MINUTES, WE'LL SEE THE MULTIPLE SEE CONDITIONS ALLIANCEMENTS
OF FAMILY MEMBERS, WE'LL START TO GET A SENSE OF DOMAIN ARCHITECTURE.
SO WHEN I USE THAT TERM, ALL THAT MEANS IS WHAT IS THE SERIES OF DOMAINS THAT I SEE IN
A PARTICULAR PROTEIN. THE NATURE OF THE DOMAINS AND THE ORDER OF
THE DOMAINS. I'LL GET AN IDEA OF WHICH BIOLOGICAL SPECIES
HAVE PROTEINS THAT MEET, THAT ARE MEMBERS THAT HAVE PARTICULAR CLASS.
SOME INFORMATION ON KNOWN PROTEIN STRUCTURES AND LINKS TO SOME OTHER DATABASES AS WELL.
THERE ARE TWO TYPES OF PFAM ENTRIES. ONE IS P FAM A.
THESE ARE THE BETTER OF THE TWO OPTIONS. IN PRIOR VERSION THIS HAD BEEN PFAM THESE
ARE BASED ON CURATED MULTIPLE ASSIGNMENTS. A METHOD IS USED TO BIND ALL OF THE DETECTABLE
PROTEIN SEQUENCES THAT MATCH THAT ADDITIONALLY. SOMEONE HAS GONE AHEAD MADE THIS MULTIPLE
SEQUENCE ALIGNMENT FOR YOU. GENERATED THE SCORING MATRIX AND THEN SEARCHED
THE DATABASES TO FIND ALL OF THE OTHER MEMBERS OF THE FAMILY.
SO, BECAUSE OF THE WAY IT'S DONE, BECAUSE OF THIS HANDCRAFTED METHOD THE HITS THAT YOU
FIND TO PFAM A ARE MORE THAN LIKELY TRUE POSITIVE. PFAM B IS STRICTLY AUTOMATED.
THAT IS DEEMED TO BE LOWER QUALITY. BUT YOU SHOULD CERTAINLY LOOK AT THOSE.
BECAUSE YOU MIGHT HAVE A SITUATION WHERE YOU HAVE DONE A QUERY, YOU DON'T FIND A MATCH
TO PFAM A BUT YOU MIGHT SEE SOMETHING TO PFAM B.
USING HANDCRAFTED BEER ANALOGY, PFAM A IS LIKE THE STELLA IS PFAM IS MORE LIKE OTHERS.
TO REMIND YOU, WE'RE GOING TO GO THROUGH ALL OF THESE EXAMPLES AS IF WE ARE SITTING AT
THE COMPUTER, AS WITH LAST WEEK IF YOU WOULD LIKE TO GO BACK TO YOUR LABS OR BACK TO YOUR
OFFICES AND REPEAT WHAT WE'RE DOING IN CLASS THIS MORNING YOU CAN GO TO THIS PAGE, ALL
OF THE SEQUENCES ARE ON THIS PAGE. YOU CAN JUST CUT THEM, PASTE THEM IN TO THE
BOXES AND FOLLOW THE STEPS THAT WE'RE GOING THROUGH.
LET'S GO TO THE PFAM SITE. AS ALWAYS I'LL GIVE YOU THE URL AT THE TOP
OF THE PAGE. WE'VE GONE OFF TO THE SANGER CENTER IN THE
U.K. HERE IS VERY SIMPLE FRONT END WHAT WE'RE GOING
TO DO IS SEQUENCE SEARCH, LET'S TAKE LOOK WHAT ELSE WE HAVE.
YOU CAN -- I LOVE THE NAME OF THIS, YOU CAN SEE SOME GROUPS OF RELATED FAMILIES WITH DEFERENCE
TO THE SCOTTISH PEOPLE THAT WORK THAN DATABASE. YOU CAN JUMP DIRECTLY TO A PARTICULAR PROTEIN
FAMILY IF YOU KNOW THE NAME THAT HAVE PROTEIN FAMILY JUST A SIMPLE KEY WORD SEARCH.
AGAIN WE'RE JUST GOING TO GO AHEAD DO OUR SEQUENCE SEARCH.
IF WE CLICK ON THE WORDS WE GET A BOX. WE CAN QUITE SIMPLY PUT OUR SEQUENCE IN THAT
BOX. BUT AS YOU HAVE COME TO LEARN, I LIKE TO LOOK
AT THE OPTIONS THAT ARE AVAILABLE TO ME. SO TO SEE WHAT PARAMETERS WE CAN CHANGE HERE
THERE IS A VERY UNOBTRUSIVE LINK HERE THAT JUST DECEMBER, PERFORM OTHER SEARCHES HERE.
YOU CLICK ON THE WORD "HERE" THAT WILL NOW EXPAND THIS OUT SO WE CAN SEE WHAT IS AVAILABLE
TO US. HERE IS OUR SEQUENCE BOX.
I HAVE PASTED IN THE APPROPRIATE SEQUENCE FROM OUR LIST OF SEQUENCES.
BUT WE HAVE JUST A FEW OPTIONS HERE THAT ARE WORTH TALKING ABOUT.
FIRST ONE IS CALLED CUT OFF. IT IS AUTOMATICALLY SET TO USE E-VALUE OF
1.0. THAT IS THE DEFAULT VALUE.
BUT KEEP IN MIND THE GUIDELINES THAT I GAVE YOU LAST WEEK FOR PROTEIN SEQUENCE COMPARISONS,
BECAUSE THAT'S BASICALLY WHAT WE'RE DOING HERE EVEN THOUGH WE'RE USING THESE MATRICES
IT'S STILL A PROTEIN-PROTEIN COMPARISON. SO, IN THE BACK OF YOUR HEAD, KEEP THAT TEN
TO THE MINUS THIRD GUIDELINE IN MIND AS YOU LOOK AT THE RESULTS.
WHAT GATHERING THRESHOLDS HERE MEANS IS THAT IT WILL AUTOMATICALLY, IF YOU WANT TO ADJUST
THAT E-VALUE TO USE THE SAME E-VALUE THAT WAS USED TO CONSTRUCT EACH INDIVIDUAL SCORING
MATRIX SO IF YOU HAVE REASON TO DO THAT, THAT'S FINE BUT I USUALLY JUST LEAVE THAT AS IS.
REALLY THE MOST IMPORTANT THING HERE TO JUST CLICK OFF THIS SEARCH FOR PFAM B SPOTS WHICH
IS UNCLICKED BY DEFAULT. WE DO WANT TO SEE THOSE RESULTS EVEN THOUGH
THEY ARE LOWER QUALITY THERE MIGHT BE SOMETHING INTERESTING THERE.
WE'LL CLICK ON "SUBMIT" SEE WHAT WE GET. WHAT WE GET IS SOMETHING THAT LOOKS LIKE THIS.
AT THE VERY TOP OF THE PAGE WE HAVE AN OVERVIEW OF OUR RESULTS.
JUST SAYS WE FOUND THREE PFAM A MATCHES TO YOUR SEARCH SEQUENCE, ONE SIGNIFICANT TWO,
INSIGNIFICANT BUT WE DIDN'T FIND ANY PFAM B MATCHES.
BUT ONLY LOOK AT ONE HIT. THEN HAVE LEAPT REPRESENTATION OF FOUND, IT'S
MARKED P450. THIS HAD A CYTOCHROME DOMAIN.
YOU'LL NOTICE THAT THE LEFT SIDE IS ROUNDED AND THE RIGHT SIDE IS JAGGED.
THE LEFT SIDE BEING ROUNDED JUST MEANS THAT OUR SEQUENCE LINES UP WITH THE BEGINNING OF
THE MOTIF FOR THAT PARTICULAR PROTEIN DOMAIN. BUT THE JAGGED PART MEANS WE DIDN'T GET ALL
THE WAY TO THE OTHER END. THIS IS PARTIAL OVERLAP WITH A PARTICULAR
DOMAIN, IN THIS CASE P450. IF WE NOW LOOK AT THE TABULAR RESULTS, CYTOCHROME
P450, OUR SEQUENCE ALLIANCE WITH THE DOMAIN STARTING AT POSITION 41.
ENDING AT POSITION 500. TO SEE A LITTLE BIT MORE DETAIL YOU'LL SEE
IN THE RED BOX HERE THAT SAYS, SHOW, OR HIDE ALL ALIGNMENT.
IF YOU CLICK ON THE WORD "SHOW" JUST BASICALLY EXPANDS THE PAGE.
INSTEAD NOW WHAT WE HAVE HERE IS THE ALIGNMENT. THERE IS A NUMBER OF LINES IN THIS ALIGNMENT,
THIS MIGHT BE EASIER SEE ON YOUR HAND OUT. YOUR SEQUENCE, THAT WE STARTED WITH FOR THE
QUERY IS THE ONE THAT IS JUST LABELED SEQ FOR SEQUENCE.
THE HMM IN THE FIRST ROW IS THE ACTUAL CONSENSUS FROM THE PROFILE THAT IT MATCHED.
SO THAT IS THE ACTUAL P450 PROFILE REPRESENTED CONSENSUS AUTO QUEENS.
NEXT LINE DOWN WHERE IT SAYS MATCH, JUST QUITE SIMPLY WHICH POSITIONS MATCH YOU HAVE QUALITATIVE,
JUST A VISUAL OVERVIEW OF HOW GOOD THOSE MATCHES WERE.
SAME RULES AS LAST WEEK. ANY PLACE YOU SAW -- SEE EXACT MATCH YOU SEE
LETTER REPEATED ON THAT LINE. ANY PLACE YOU SEE CONSERVATIVE SUBSTITUTION
YOU SEE A PLUS SIGN. THEN THAT FINAL LINE "PP" IS POSTERIOR PROBABILITY
THEM IS JUST A QUANTITATIVE MEASURE HOW GOOD THE MATCH IS POSITION BY POSITION TO THIS
PARTICULAR POSITION SPECIFIC SCORING MATRIX AND THE RULES FOR THAT ARE IN THE DOCUMENTATION
RIGHT ABOVE THAT. WE SEARCH FOR PFAM B, WE DIDN'T FIND ANYTHING.
LET'S GO AHEAD JUST NOW GO TO THE ENTRY WHERE THE MORE INTERESTING STUFF ACTUALLY IS.
THIS IS NOW JUST A SUMMARY PAGE FOR MOTIF, AGAIN IN OUR CASE THIS DOMAIN IS THE P450
DOMAIN. IT STARTS OFF WITH WHAT HE LIKE TO CALL AN
EXECUTIVE SUMMARY. AND WE'RE GOING TO SEE A LOT OF THESE AS WE
GO THROUGH THIS MORNING AND WHAT THESE EXECUTIVE SUMMARY REPRESENTS SOMEONE WHO KNOWS THIS
PARTICULAR ENTICE. THIS PROTEIN DOMAIN VERY, VERY WELL, KNOWS
LITERATURE INSIDE AND OUT MORE THAN LIKELY ACTIVE RESEARCHER STUDYING THESE PROTEINS
HAS WRITTEN FOR YOU WHAT THEY THINK THE MOST IMPORTANT THINGS YOU NEED TO KNOW ABOUT THIS
PARTICULAR PROTEIN DOMAIN ARE. ANY TIME AN EXPERT WILLING TO TAKE THE TIME
AND DO THAT AND SHARE THAT WITH YOU, YOU SHOULD ABSOLUTELY TAKE ADVANTAGE OF IT.
RIGHT BELOW YOU'LL SEE THAT THERE ARE REFERENCES GOING BACK TO THE PRIMARY LITERATURE, AGAIN,
EXECUTIVE SUMMARIES ARE VERY IMPORTANT, THEY'RE NOT A SUBSTITUTE FOR READING THE LITERATURE.
BUT THEY WILL AT LEAST DIRECT YOU TO WHAT ARE THE MORE IMPORTANT PAPERS TO READ IN THE
LITERATURE. THERE'S A LITTLE BIT OF JUDGMENT AS TO WHICH
PAPERS ARE IMPORTANT AND WHICH ONES MAY BE NOT SO MUCH.
ON THE RIGHT HAND SIDE WE HAVE SAMPLE STRUCTURE. WHEN DO YOU THIS, IT WILL JUST RANDOMLY PICK
ONE OF THE STRUCTURES THAT HAS THIS PARTICULAR DOMAIN IN IT.
IF YOU WANT TO SEE DIFFERENT STRUCTURES YOU HAVE A PULL-DOWN MENU RIGHT BELOW WHERE YOU
CAN SWITCH TO OTHER STRUCTURES. NOW, WE'RE GOING TO TAKE ADVANTAGE OF TWO
LINKS AT THE VERY TOP OF THE PAGE, YOU'LL SEE IT SAYS 152 ARCHITECTURE.
AND 18,883 SEQUENCES. LET'S START WITH THE 152 ARCHITECTURES.
REMEMBER WHEN WE TALK ABOUT DOMAIN ARCHITECTURE WE'RE TALKING ABOUT THE ORDER OF DOMAIN IN
A PARTICULAR PROTEIN OR SET OF PROTEINS. THE VERY FIRST ONE IN THIS LIST TELLS US THAT
THERE ARE 16,000 PLUS SEQUENCES THAT HAVE A P450 DOMAIN IN IT.
RIGHT BELOW YOU'LL SEE BUTTON THAT SAYS, SHOW. IF YOU CLICK ON THAT BUTTON YOU CAN ACTUALLY
GET ALL OF THOSE SEQUENCES, COLLECT THOSE, STORE THEM ON YOUR HARD DRIVE AND USE THEM
FOR SOME OTHER SORT OF ANALYSIS. VERY QUICK WAY TO JUST ALL AT ONCE GET A COMPLETE
DATA SET OF, IN THIS CASE, ALL OF THE PROTEINS THAT HAVE A P450 DOMAIN.
IF YOU WANT TO GO OFF DO PHYLOGENETIC ANALYSIS IT SAVES YOU THE TROUBLE OF DOING THE BLAST
SEARCH, SMILING THEM, EDITING. AS WE GO DOWN WE'LL SEE THAT THERE ARE OTHER
ARCHITECTURES THAT ARE PART OF THIS FAMILY. RIGHT BELOW WE'VE GOT DIFFERENT ARCHITECTURE.
P450 TIMES TWO. YOU HAVE TWO P450 DOMAINS NEXT.
YOU'VE GOT THREE NEXT TO EACH OTHER. OTHER DOMAINS ARE MIXED IN SOME OF THE OTHER
LINES AS WELL. GOOD TOOL TO HAVE IN YOUR ARSENAL.
ESPECIALLY GOING BRAC TO HOW WE STARTED OFF THIS LECTURE THINKING ABOUT WHY PROTEIN DOMAINS
ARE IMPORTANT, WHY IT'S IMPORTANT TO THINK ON THE PROTEIN SIDE OF THE HOUSE.
LET'S SAY YOU HAVE A MUTATION. THAT MUTATION MAY FALL IN ONE OF THESE DOMAINS.
HAVE YOU OBLITERATED THE DOMAIN BY THAT SIMPLE POINT MUTATION OR DELETION OR WHAT HAVE YOU
AT THE NUCLEOTIDE LEVEL. IF YOU ARE THINKING AT THE CLINICAL LEVEL
YOU HAVE MUTATION THAT KNOCKS OUT SOME RESIDUE THAT IS IMPORTANT FOR SOME SORT OF AGAIN METABOLIC
PROCESS, SOMETHING THAT IS RELEVANT IN A STRUCTURAL OR FUNCTIONAL DOMAIN THAT MIGHT EXPLAIN WHY
YOU SEE THE PHENOTYPES YOU DO IN YOUR PATIENTS. SOMETHING ELSE TO THINK ABOUT AS YOU CONSIDER
THINGS AT THE PROTEIN LEVEL. THE OTHER LINK AT THE TOP, 18,000 PLUS SEQUENCES,
LET'S TAKE A LOOK AT WHAT WE SEE IF WE CLICK ON THAT.
THERE WE GO. SO, HERE WE CAN ACTUALLY SEE PREMADE ALIGNMENTS.
YOU'LL RECALL IN OUR HANDCRAFTED ANALOGY SOMEONE HAS PUT TOGETHER A INITIAL SET OF SEQUENCES
THAT ARE FOUND THAT WERE MEMBERS OF THE P450 CLASS THAT WERE USED TO FIND ALL MUCH THE
OTHER MEMBERS OF THE CLASS. I COULD SEE JUST THOSE 50 IN THIS CASE THAT
WERE IN THAT SEED OR ALL OF THE SEQUENCES THAT MAKE UP THAT PARTICULAR PROTEIN FAMILY.
IN THIS CASE I'M JUST GOING TO LEAVE THIS SET SO I DON'T GET ANYTHING UNREASONABLY LONG.
IF I WANT TO LOOK AT THESE I CAN JUST CLICK ON THE VIEW BUTTON AND WHAT I WOULD GET IS
A MULTIPLE SEQUENCE ALIGNMENT THAT LOOKS LIKE THIS.
WE'RE GOING TO COME BACK TO THIS AT VERY END OF THE LECTURE WHEN WE TALK ABOUT THE VIEWER
THAT ALLOWS US TO MANIPULATE MULTIPLE SEQUENCE ALLIANCEMENTS BUT THE COLORS MEAN SOMETHING,
HISTOGRAMS MEAN SOMETHING I'LL TELL YOU ABOUT ALL THAT HAVE A LITTLE BIT LATER.
NOW LET'S SAY WE'RE SIT AT OUR MACHINES AND PRETEND THAT EVE GONE BACK TO THE PFAM PAGE.
WE'RE GOING TO PRETEND THAT WE'RE SCROLLING DOWN.
WHEN WE GET TO THE BOTTOM WE SEE A NUMBER OF EXTERNAL DATABASES LINKS.
THE ONE THAT I WANT TO FOCUS ON IS THE ONE HERE WHERE IT SAYS, PROCEED SITE.
IF I CLICK ON THAT ASCENSION NUMBER I COME I LEAVE THE SANGER INSTITUTE AND NOW GO TO
THE SWISS INSTITUTE WHERE THEY HAVE MAINTAINED FOR MANY YEARS A DATABASE CALLED PRO SITES.
THIS IS A COLLECTION OF PROTEIN PROFILES. PROFILES THAT I TOLD YOU ABOUT EARLIER THAT
JUST TELL YOU POSITION FOR POSITION WHAT AMINO ACIDS CAN EXIST AT A GIVEN POSITION THAT CHARACTERIZE
A GIVEN SET OF PROTEINS. SO, IN THIS PARTICULAR CASE FOR OUR P450 DOMAIN
WE SEE THE CONSENSUS PATTERN RIGHT THERE. NOW YOU KNOW HOW TO READ THAT, HOW TO MAKE
HEADS OR TALES OF THAT. SO YOU CAN USE THAT AS BASIS FOR COMPARISON.
THE MORE IMPORTANT THING I WANT TO POINT OUT ON THIS PAGE IS ONCE AGAIN YET ANOTHER EXECUTIVE
SUMMARY AND MORE IMPORTANTLY THE NAME OF THE PERSON WHO PUT THAT TOGETHER.
IF YOU NOW HAVE QUESTIONS ABOUT THIS PARTICULAR PROTEIN DOMAIN, THESE PEOPLE DO MAKE THEMSELVES
FREELY AVAILABLE TO MEMBERS OF THE BIOLOGICAL COMMUNITY.
SO YOU CAN JUST CLICK ON THEIR NAME, SEND THEM AN E-MAIL IF YOU HAVE QUESTION ABOUT
THIS PARTICULAR PROTEIN THEY WILL ANSWER THAT QUESTION FOR YOU.
IT'S ALWAYS NICE TO HAVE THAT PERSON AVAILABLE TO YOU WHEN YOU HAVE QUESTIONS THAT YOU JUST
CAN'T FIND THE ANSWERS TO IN THE LITERATURE OR FROM YOUR COLLEAGUES.
LET'S PRETEND WE'VE GONE BACKWARDS AGAIN. WE'RE ON THIS PAGE, WE CLICKED ON THIS LINK
HERE AT THE BOTTOM. FOCUS YOUR ATTENTION FURTHER WHERE IT SAYS
INTER-PRO ENTRY. ANOTHER ASCENSION NUMBER.
THAT NOW TAKES US AWAY FROM PFAM OVER TO ANOTHER DATABASE AT THE SANGER INSTITUTE.
THERE WE GO. CALLED INTER-PRO.
IT'S WHAT WE SAUL A SECONDARY DATABASE. THERE ARE -- WHAT WE CALL A SECONDARY DATABASES
A. COLLECTION OF INFORMATION THAT IS AMASSED FROM A SERIES OF OTHER PRIMARY DATABASES.
PRO SITE IS EXAMPLE OF ONE OF THOSE, THERE ARE OTHER DATABASES THAT THAT YOU CAN ABOUT
PROTEIN DOMAINS AND SIMILAR CHARACTERISTICS OF PROTEINS, WHAT INTER-PRO TRIES TO DO IS
COLLECT THOSE FOR YOU ALL IN ONE PLACE. IT'S ONE STOP SHOP.
AGAIN, WE ARE LOOKING AT CYTOCHROME P450. THE FIRST THING I WANT TO DRAW YOUR ATTENTION
TO HERE IS SOMETHING CALLED THE INTER-PRO RELATIONSHIPS.
WHAT THIS SET HERE, YOU'LL SEE IT SAYS, CHILDREN. SO THE CHILDREN ARE SUBMEMBERS OF THE CLASS.
THESE ARE SUBFAMILIES THAT ARE PART OF THE BIGGER CYTOCHROME P450 FAMILY.
THEY ARE MORE SPECIFIC MEMBERS OF THE CLASS. WE'VE GOT A B CLASS, E CLASS, VARIOUS GROUPS
OF CYTOCHROME P450 PROTEINS. BECAUSE THE CHILDREN ARE ALWAYS MORE SPECIFIC
THAN THE PARENT, IF YOU HAVE A MATCH TO THE CHILD YOU HAVE A MATCH TO THE PARENT.
THEY BY DEFINITION HAVE TO OVERLAP. YET AGAIN WE'VE GOT ANOTHER EXECUTIVE SUMMARY
HERE AT THE BOTTOM. WE HAVE SOME ANNOTATIONS WE CAN SEE THAT THIS
PARTICULAR DOMAIN IS INVOLVED IN IRON-IRON BINDING, AND SO ON.
AGAIN, LET'S PRETEND WE'RE SCROLLING DOWN THE PAGE.
I'LL GET TO THIS. HERE WE HAVE A VERY FUNKY REPRESENTATION.
WHAT THIS IS INTENDED TO CONVEY TO YOU IS THE TAXONOMIC COVERAGE OF THIS PARTICULAR
DOMAIN. PUT OTHERWISE, WHAT ORGANISMS HAVE SOME PROTEIN
IN THEM THAT HAVE P450 DOMAIN. DO YOU NOT READ ANYTHING IN TO THE LENGTH
OF THE BRANCHES OR ANYTHING, THIS IS NOT A PHYLOGENETIC TREE.
WHICH ORGANISMS HAVE P450 ORGANISMS. THE CENTER OF THIS TREE ARE JUST -- IS JUST
THE ROOT AS WE GO FURTHER OUT THE INNER NODES ARE THE TREE NODES.
AT LEAST HOW MANY CHARACTERIZED SEQUENCES MAY NOT BE THE ABSOLUTE NUMBER OF SEQUENCES
BECAUSE YOU MAY HAVE SEQUENCES THAT HAVE BEEN DOCUMENTED TWO, THREE, FOUR TIMES.
IF YOU CLICK ON THOSE LINKS AGAIN YOU CAN DOWNLOAD SEQUENCE SET USE THEM FOR SOME OTHER
PURPOSE POSSIBLY DOING PHYLOGENETIC ANALYSIS OR SOMETHING ELSE.
WE NOW HAVE ANOTHER WAY OF LOOKING AT DOMAIN ARCHITECTURE.
IN THIS CASE WE'VE GOT A LITTLE BIT DIFFERENT REPRESENTATION.
IN THIS CASE WE'VE GOT A PROTEIN HAVING THIS ACCESS NUMBER.
PROTEIN REPRESENTED BY THE RED BAR. DOMAINS ARE SHOWN UNDERNEATH.
A DIFFERENT WAY OF LOOKING AT WHAT DOMAINS COMPRISE A PARTICULAR PROTEIN.
THIS IS SOMETHING I WANT TO YOU READ, THIS IS WORK THAT I'M EDITOR IN CHIEF OF.
WHAT I WANT YOU TO LOOK AT SPECIFICALLY ARE TWO UNITS, THE FIRST ONE IS ON MUCH DEEPER
TREATMENT OF PFAM. THE OTHER ONE TALKS IN GREAT DETAIL ABOUT
INTER-PRO. WHAT IS NICE ABOUT THESE UNITS FOR THOSE OF
YOU WHO HAVE NOT JUST PRINT OUT PAGES, SIT NEXT TO YOUR COMPUTER JUST STEP BY STEP FOLLOW
THE INSTRUCTIONS IT WILL TEACH YOU HANDS ON HOW TO DO THIS.
PUT HANDS ON KEYBOARDS START TO *** AWAY. VERY IMPORTANT THAT YOU SPEND SOME TIME TRYING
TO DO THESE TECHNIQUES IN PRACTICE. THIS STREET THROUGH THE N.I.H. LIBRARIES FREE
OF CHARGE. SEARCH ONLINE JOURNAL JOURNALS, TYPE IN THE
NAME. THAT WILL TAKE YOU DIRECTLY TO THOSE LISTINGS.
THE CONSERVED DOMAIN DATABASE THAT USES SLIGHTLY DIFFERENT METHOD.
SO, LIKE SOME OF THE OTHER THINGS THAT WE'VE SEEN THIS IS THE SECONDARY DATABASE.
WE CAN SEARCH EVERYTHING THAT'S IN PFAM AND A AND B.
WE CAN SEARCH SOMETHING CALLED SMART, THE SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL.
CLUSTERS OF ORTHOLOGOUS GROUP WHICH IS COLLECTION OF PROTEIN FAMILIES PUT TOGETHER BY EUGENE
AT NCBI AND OTHER RESOURCES AS WELL. NOW, IN THE INTEREST OF MAKING SURE YOU DON'T
USE THESE THINGS AS A BLACK BOX, I JUST NEED TO POINT OUT THAT THE SEARCHES HERE ARE PERFORMED
USING SOMETHING CALLED, RPS BLAST. THIS IS A VARIANCE EVER BLAST, REVERSE POSITION
SPECIFIC WHERE YOUR QUERY SEQUENCE IS USED ONCE AGAIN TO SEARCH A SERIES OF THOSE POSITIONS
SPECIFIC SCORING MATRICES. SAME GENERAL IDEA WHAT I SHOWED YOU AT THE
BEGINNING OF THE LECTURE. HOWEVER, ACTUAL METHODOLOGY, THE ACTUAL ALGORYTHMN
IS USED IS DIFFERENT THAN IS USED BY P FAM. ONE TWO OF THE TOOLS WILL GIVE YOU ONE SET
OF RESULTS, THE OTHER TOOL WILL GIVE YOU SLIGHTLY DIFFERENT OVERLAPPING YET SLIGHTLY DIFFERENT
SET OF RESULTS. TAKE HOME MESSAGE WHEN YOU WANT TO DO THESE
KIND OF ANALYSES TO SEE WHAT KIND OF PROTEIN DOMAINS EXIST IN YOUR PROTEIN OF INTEREST,
DO BOTH. IT'S ALWAYS COMFORT AND CONSISTENCY BETWEEN
THE METHOD. LET'S AGAIN PRETEND WE'RE AT THE KEYBOARDS
AND TAKE YOU THROUGH EXAMPLE. THIS IS THE CDD HOME PAGE, THERE'S VERY LONG
URL HERE. ACTUALLY EASIER TO JUST GO TO THE NCBI AND
LOOK FOR STRUCTURE LINK OFF OF THE HOME PAGE. IF YOU WANT TO TYPE THAT IN, THERE IT IS.
HERE, WE DON'T HAVE ANY OPTIONS WE CAN THROW. JUST A BOX.
AND WE HAVE A CHOICE OF WHICH DATABASES WE CAN SEARCH.
FROM THE PREVIOUS SLIDE WE CAN SEARCH ANY ONE OF THOSE INDIVIDUAL DATABASES.
OR WE CAN JUST SEARCH THEM ALL AT ONCE SO THAT'S WHAT WE'RE GOING TO DO JUST LEAVE IT
AT THE DEFAULT. SEQUENCE WE'RE USING HERE IS DELETED IN COLORECTAL
CARCINOMA BETWEEN -- PROTEIN SEQUENCE FROM HUMAN.
LET'S SAY WE PUT THAT IN THE BOX, CLICKED ON SUBMIT.
THIS IS WHAT WE GET BACK. IT'S A LITTLE BIT REMINISCENT OF THE BLAST
RESULTS WE SAW LAST WEEK IN THAT WE HAVE REPRESENTATION AT THE TOP OF OUR PROTEIN HITS.
OUR PROTEIN IS THIS GOING FROM ONE TO 14. 47 AND YOU'LL SEE BELOW, EACH ONE OF THESE
BOXES REPRESENTS ONE OF THE DOMAINS THAT WAS FOUND IN THIS PROTEIN.
DESCRIPTION OF WHAT THAT PARTICULAR DOMAIN IS.
IDENTIFIER WHICH WILL COME BACK TO, AND THE PROBABILITY VALUE.
AGAIN, SAME GUIDELINES AS LAST WEEK APPLY TEN TO THE MINUS THIRD, BECAUSE WE'RE DOING
A COMPARISON AT THE PROTEIN LEVEL. IF WE WANT TO SEE A LITTLE BIT MORE ABOUT
THIS FIRST HIT, ALL I HAVE TO DO IS CLICK ON THAT PLUS SIGN.
THAT WILL ACTUALLY EXPAND THIS OUT, I CAN ACTUALLY NOW SEE AN ALIGNMENT.
SO, IN THIS ALIGNMENT I JUST HAVE A SENSE OF HOW GOOD MATCHES THIS PARTICULAR DOMAIN
CONSENSUS. MY SEQUENCE, HUMAN D.C.C IS IN THE FIRST LINE.
IN MY SEQUENCE POSITIONS 41 TO 136 LINED UP WITH THE DOMAIN EVERY PLACE YOU SEE POSITION
MARKED IN RED THAT IS EXACT MATCH. EVERYTHING ELSE IS EITHER A CONSERVATIVE SUBSTITUTION
OR A MISMATCH. SO, JUST AT A GLANCE YOU CAN SEE MOST OF THE
POSITIONS ARE IN RED AND BEARS WELL FOR BEING A TRUE POSITIVE, AGAIN, HERE IS THE PROBABILITY
VALUE WHICH GIVES US THE QUANTITATIVE MEASURE OF HOW GOOD OUR HIT IN THIS CASE S. IF I WANT
TO LEARN A LITTLE BIT MORE ABOUT THIS PARTICULAR NEOGENE DOMAIN THAT WE FOUND IN OUR PROTEIN
OF INTEREST THAT IS WHERE THIS NUMBER COMES IN TO PLAY.
THIS SUCCESSION NUMBER IS IN COLUMN THAT IS LABELED TSSM, POSITION SPECIFIC SCORING MATRIX
I.D. IF I CLICK ON THAT I.D. THAT TAKES ME TO AN
EXPANDED VIEW. SOMETHING CALLED THE CONSERVE DOMAIN DATABASE
AT NCBI. AGAIN, A QUICK SUMMARY OF WHAT IS KNOWN ABOUT
THIS PARTICULAR DOMAIN. THE REFERENCES THAT SUPPORT THAT PARTICULAR
DESCRIPTION SO I CAN GO BACK TO WHAT ARE DEEMED TO BE THE MORE IMPORTANT PAPERS ON THIS PARTICULAR
DOMAIN. IF I SCROLL DOWN A LITTLE BIT, WE HAVE A REPRESENTATION
OF HOW THIS DOMAIN RELATES TO OTHER DOMAIN THERE.
IS HIERARCHY HERE. JUST GOING TO BYPASS THAT, I DON'T FIND THAT
TO BE PARTICULARLY USEFUL I JUST WANT TO POINT OUT TO YOU.
BUT MOST IMPORTANTLY BOTTOM IS THE SEQUENCE ALIGNMENT.
VERY QUICKLY YOU CAN SEE WHAT THIS ALIGNMENT LOOKS LIKE, WHO THE OTHER MEMBERS ARE OF THIS
PARTICULAR CLASS. ONCE GIBB YOU CAN DOWNLOAD THESE SEQUENCES
TO USE FOR SOME OTHER PURPOSE IN SOME THIRD PARTY SOFTWARE, THERE'S A LINK OFF OF THIS
PAGE THAT DESCRIBES TO YOU HOW TO ACTUALLY DO.
THAT SO, WITH THAT, I WANT TO NOW FLIP THE ANALOGY AROUND.
THE BEGINNING WE SAID WE CAN DO OUR SEARCHES ONE TO MANY.
IN THIS CASE THE ONE WAS OUR SEQUENCE OF INTEREST. THE MANY WAS EITHER OUR PROFILES OR OUR PATTERNS.
LET'S FLIP THAT AROUND. AND NOW WHAT WE'RE ACTUALLY GOING TO DO IS
CONSTRUCT A PROFILE TO ENABLE US TO FIND DISTANTLY RELATED PROTEINS RELATED TO THE ONE THAT WE
START WITH. THE ONE THAT WE'RE INTERESTED IN.
THE WAY WE'RE GOING TO DO THAT IS BY USING A TOOL CALLED, POSITION SPECIFIC IT RATED.
THIS IS INCREDIBLY EASY TO USE. PSITHE EXACT SAME WAY WE DID LAST WEEK.
TAKE SEQUENCE OF INTEREST. TAKE THE VARIOUS PARAMETER THAT WE WANT TO
CHANGE, CLICK ON THE GO BUTTON IT WILL JUST DO THAT BLAST P SEARCH.
ONCE WE GET BACK OUR LIST OF HITS, GOING TO TAKE THAT HIT LIST AND EVERYTHING THAT IS
ABOVE OUR PROBABILITY THRESHOLD THAT WE'RE GOING TO SET, IT WILL TAKE ALL OF THOSE SEQUENCES
IN THE HIT LIST, CONSTRUCT A MULTIPLE SEQUENCE ALIGNMENT, DERIVE THE POSITION SPECIFIC SCORING
MATRIX AND NOW USE THAT MATRIX AS THE INPUT FOR THE NEXT ROUND OF SEARCHES.
SO, WE STARTED WITH A SINGLE SEQUENCE, WE ENDED UP WITH A POSITION SPECIFIC SCORING
MATRIX, WE THROW AWAY THAT INITIAL SEQUENCE. AND AGAIN JUST USE THESE MATRICES THAT WILL
BE RECALCULATED ROUND AFTER ROUND UNTIL WE'VE IDENTIFIED ALL OF THE MEMBERS OF THAT PARTICULAR
CLASS. HOPEFULLY WE WILL COME TO CONVERGENCE WHERE
ALL RELATED SEQUENCES ARE DEEM FOUND F. WE KEEP GOING ROUND AFTER ROUND AND NUMBERS KEEP
GETTING BIGGER AND BIGGER, AT SOME POINT YOU WILL HAVE BROUGHT ALL OF GEN BANK IN TO YOUR
QUERY. YOUR QUERY IS DEEMED TO BE A LITTLE BIT TOO
BROAD. SO WHAT YOU HAVE TO DO IS JUST USE A SHORTER
REGION OUT MAKE YOUR CUT OFF A LITTLE BIT MORE STRINGENT.
THAT RARELY HAPPENS WITH SOMETHING TO BE MIND OFFUL OF.
HOW DO WE DO THIS. HOPEFULLY THIS NOW LOOKS FAMILIAR TO YOU,
AGAIN HERE IS THE URL THAT TAKES YOU TO THE BLAST HOME PAGE.
OUR PROTEIN-BASED SEARCHES ARE HERE. UNDER THE PROTEIN BLAST LINK, AGAIN, WE WILL
CLICK ON THAT AS WE DID LAST WEEK. AND THIS WILL TAKE US NOW TO THE BLAST HOME
PAGE. AS WE DID LAST WEEK, WE'RE GOING TO PASTE
THE SEQUENCE IN TO THE BOX, IN THIS CASE THIS IS A DNA BINDING PROTEIN, HIGH MOBILITY GROUP
PROTEIN THAT WE'RE LOOKING FOR. AND LET'S TAKE A LOOK AT SOME OF OUR OPTIONS.
THE FIRST THING WE GET TO DO IS CHOOSE WHAT DATABASE WE WANT.
THE SAME OPTIONS ARE AVAILABLE AS LAST WEEK SO WE COULD PICK -- WHICH WAS THE CURATED
DATABASE I TOLD YOU ABOUT LAST TIME AROUND. I ALSO ALLUDED TO SWISS, LET ME TELL YOU ABOUT
THAT. THIS TIME AROUND.
THE SAME WAY THAT RES-SEEK IS INTENDED TO PRODUCE EACH MOLECULE ONCE AND ONLY ONE.
HE ONLY HAVE ONE ENTRY FOR EACH DNA SEQUENCE AND PROTEIN SEQUENCE, SWISS IS IN TEDDED TO
DO THE SAME THING BUT ONLY ON THE PROTEIN SIDE.
ONLY A COLLECTION OF PROTEIN SEQUENCES. THIS IS A LONG STANDING, 30-PLUS YEAR EFFORT
THAT HAS BEEN GOING ON AT THE SWISS INSTITUTE FOR BIOINFORMATICS.
WHAT IS NICE ABOUT THIS IS THAT BY DEFINITION OF COURSE THESE ARE NONREDUNDANT.
THERE'S INTEGRATION WITH OTHER DATABASES. THERE'S ONGOING ENTRIES BY EXTERNAL EXPERTS.
THIS REALLY RELIES ON ACTIVE EXPERIMENTAL LIST IN THE FIELD TO KEEP THESE ENTRIES UP
TO DATE AND MORE IMPORTANTLY WHEN YOU LOOK AT THE FEATURE TABLES IN THESE ENTRIES YOU'LL
SEE A BUNCH OF COMMENT LINES. YET ANOTHER EXECUTIVE SUMMARY BY THE ACTIVE
INVESTIGATE OR IN THAT FIELD. AGAIN, DO TAKE ADVANTAGE OF THOSE RESOURCES
WHENEVER THEY'RE AVAILABLE TO YOU. WHEN YOU LOOK AT THAT ASCENSION NUMBER.
IT IS THE UNIQUE IDENTIFIER, THE SEQUENCE AND SOCIAL SECURITY NUMBER IN THE FIRST POSITION
YOU WILL SEE EITHER A NULL OR P OR Q FOLLOWED BY FIVE NUMBERS.
WHEN YOU SEE THAT, AN O, P, Q, FOLLOWED BY FIVE DIGITS.
THAT IS A SWISS ENTRY. LET'S GO BACK TOLT SEARCH.
I SPEND SWISS HERE. THE REASON IS SIMPLE THIS IS NONREDUNDANT
DATABASE WE'LL GET NICE TIDY LIST OF RESULTS BACK.
WE WON'T HAVE MULTIPLE HITS ON THE SAME SEQUENCE. ALL RIGHT.
AS BEFORE, HIDDEN BELOW THE BLAST BUTTON. THE SO THE VERY FIRST THING WE HAVE, HOW MANY
TARGET SEQUENCES DO WE WANT TO HAVE RETURN BACK TO US.
THE DEFAULT IS 500 I SET THAT TO ANY NORTHBOUND LET ME SET IT TO A THOUSAND.
WE'LL CALL LAST WEEK'S EXAMPLE IF WE LEFT THE DEFAULT WE WOULD HAVE MISSED IN OUR HIT
LIST. JUST SET THAT NUMBER AS HIGH AS YOU CAN.
THE EXPECTATION THRESHOLD, THE E VALUE, OUR MEASURE OF WHETHER SOMETHING IS A FALSE POSITIVE.
DEFAULT IS TEN, I'M GOING TO GO AHEAD USE MY GUIDELINE OF TEN TO THE MINUS THIRD OR
.001. AGAIN, REMEMBER THOSE ARE GUIDELINES, THOSE
ARE NOT ABSOLUTES. WE'RE JUST GOING TO USE THAT AT A STARTING
POINT HERE. WE SHOULD AGAIN FILTER THE COMPLEXITY REGION
THESE ARE THE POLYMERIC RUNS WHERE YOU HAVE STRETCHES OF THE SAME LETTER, THOSE TEND TO
CONFOUND THE BLAST SEARCHES. WE WANT TO JUST MASK THOSE AND NOT CONSIDER
THOSE AS PART OF THE SEQUENCE SEARCHES. FINALLY AT THE BOTTOM, SOMETHING WE DIDN'T
TALK ABOUT LAST WEEK, WE NOW HAVE A SECTION OF THIS PAGE THAT IS SPECIFICALLY GEARED TOWARDS
THE PSI BLAST SEARCHES AND SOMETHING ELSE CALLED THE WORKS, WE'RE NOT GOING TO TALK
ABOUT TODAY. LET'S SAY YOU HAD A SCORING MATRIX YOU WANT
TO USE THAT AS INPUT. YOU CAN JUST UPLOAD THAT IF YOU FOUND IT MAYBE
ON SOME OTHER DATABASE YOU CAN USE THAT AS INPUT FOR YOUR FIRST ROUND.
WE DON'T HAVE THAT, WE'RE JUST GOING TO GO AHEAD USE OUR ONE SEQUENCE OF INTEREST.
SET THE PSI BLAST THRESHOLD AT .001 AS WELL. DEFAULT HERE IS .005.
THERE MIGHT BE VIRTUE TO MAKING THAT A LITTLE BIT HIGHER BECAUSE AS WITH LAST WEEK WE WANT
TO TAKE A LOOK AT WHAT PLATE BE OWN EITHER SIDE OF OUR CUT OFF LINE USING OUR BIOLOGICAL
KNOWLEDGE TO SAY, WHICH ONE DO I WANT TO INCLUDE. MY PERSONAL PREFERENCE TO MAKE THINGS LESS
UNWIELDY. OFF WE GO.
NOW, THIS LOOKS LIKE WHAT WE HAVE SEEN AGAIN LAST WEEK.
ONE TO 215 IN THIS PARTICULAR CASE. FOUND TWO MATCHES.
THERE ARE TWO HMG BOXES IN THIS PARTICULAR PROTEIN.
AND YOU'LL SEE ONCE AGAIN OUR VERY EASY SCORING TABLE THAT WE TALKED ABOUT LAST WEEK.
SHOWING PRACTICAL REPRESENTATION OF ALL OF THE HITS THAT WERE FOUND BASED ON OUR INITIAL
QUERY. IF I NOW SCROLL DOWN TO WHERE THE DESCRIPTIONS
ARE. JUST REMIND YOU HOW THIS IS ORIENTED.
YOU'LL ALWAYS SEE ASCENSION NUMBER AT THE BEGINNING WHICH IS HYPERLINKED.
IF YOU HAVE SHORT DESCRIPTION OF WHAT THAT PARTICULAR PROTEIN THAT WAS FOUND ACTUALLY
IS, THE SCORE FOR THAT, IF YOU CLICK ON THAT, THAT WILL TAKE YOU DOWN TO THE ALIGNMENT.
REMEMBER THE SCORES ARE LESS IMPORTANT HERE, THE PROBABILITY VALUES WHAT ARE I WANT TO
YOU ALWAYS LOOK AT IN THE E VALUE COOL I'M. COLUMN.
NOW, LET'S -- IN THE VERY BEGINNING THE WORD "NEW" YOU MIGHT SEE ONE OF TWO THINGS IN THAT
POSITION, THE KEY IS HERE AT THE TOP. IF YOU SEE "NEW" ALIGNMENT SCORE BELOW THRESHOLD
ON PREVIOUS ITERATION. WE DIDN'T HAVE ONE, SO EVERYTHING IS MARKED
"NEW." BUT AS WE GO THROUGH THE SUCCESSIVE ROUNDS
HERE. ANYTHING THAT IS CARRIED OVER FROM THE PREVIOUS
ROUND WILL BE MARKED WITH A GREEN DOT. ANYTHING THAT WAS FOUND THAT IS NEW IN THE
NEW ROUNDS WILL THEN HAVE THE NEW LABEL NEXT TO THEM.
LET'S MOVE DOWN TO THE BOTTOM OF OUR HIT LIST HERE.
IT SAYS HERE, RUN, PSI BLAST ITERATION WITH MAXIMUM, ONCE AGAIN A THOUSAND SEQUENCES RETURNED.
WHAT WILL NOW HAPPEN JUST REMIND YOU IS, WE THROW OUT OUR INITIAL QUERY.
WE HAVE A POSITION SPECIFIC SCORING MATRIX CALCULATED BASED ON THE THINGS IN THIS LIST.
YOU CAN INCLUDE ALL OF THE THINGS IN THE LIST, HENCE ALL OF THE CHECK BOXES.
BUT AGAIN HERE IS WHERE YOUR BIOLOGICAL KNOWLEDGE COMES IN TO PLAY.
MAYBE YOU SEE THINGS ON TOWARDS THE BOTTOM OF THIS LIST THAT DO NOT BELONG.
IF THEY DON'T, BECAUSE YOU HAVE SOME OTHER PIECE OF INFORMATION UNCHECK THE BOX.
THAT WAY IT WILL NOT BE INCLUDED IN THE NEXT ROUND.
GO AHEAD CLICK ON GO. THIS WILL GO AROUND AS MANY TIMES AS IT HAS
TO. YOU WILL JUST HAVE TO WAIT FOR IT AND YOU
JUST KEEP CLICKING ON "GO" AND YOU GO TO THREE, FOUR, FIVE.
IN THIS CASE YOU GO AROUND 11 TIMES. UNTIL YOU FINALLY REACH CONVERGENCE.
AND YOU KNOW YOU HAVE REACHED CONVERGENCE WHEN YOU SEE THE MESSAGE AT THE TOP.
NO NEW SEQUENCES WERE FOUND ABOVE THE .001 THRESHOLD.
AT THIS POINT WE CAN HAVE CONFIDENCE THAT WE HAVE FOUND ALL OF THE MEMBERS OF THIS,
IN THIS CASE, HIGH MOBILITY GROUP OF CLASS PROTEINS THAT WE CAN FIND USING THIS PARTICULAR
METHOD. TO DRIVE HOME WHY THIS IS A POWERFUL TECHNIQUE
THAT YOU SHOULD HAVE IN YOUR ARSENAL. IF YOU RECALL IN ROUND ONE, I THINK I POINT
THIS OUT, WE HAD 132 HITS. BY THE TIME WE GOT TO ROUND 11 WE HAD 180
HITS. WE FOUND 48 ADDITIONAL SEQUENCES THAT WE WOULD
NOT HAVE FOUND JUST BY USING OUR TRADITIONAL BLAST SEARCH.
SO, VERY IMPORTANT, ESPECIALLY IF YOU'RE DEALING WITH PROTEINS THAT HAVE NOT BEEN HIGHLY CONSERVED
OVER EVOLUTIONARY TIME. THE THINGS WHERE THERE IS A LOT OF EVOLUTIONARY
PRESSURE TO NOT HAVE MUTATIONS, NOT HAVE CHANGES, YOU'RE NOT GOING TO REALLY PICK UP ANYTHING
BY USING PSI BLAST. BUT FOR MOST CLASSES OF PROTEINS THERE IS
WIGGLE IN THESE CLASSES OF PROTEIN WORTH TAKING THE TIME TO DO THIS EXTRA SET OF SEARCHES
TO SEE WHAT ELSE YOU CAN FIND. THIS HOPEFULLY DEMONSTRATES NOW THE POWER
OF USING THE COLLECTIVE CHARACTERISTICS OF THE PROTEIN FAMILY TO FIND THINGS.
SOW THAT, WE'RE GOING TO LEAVE THINGS AT THE SEQUENCE LEVEL BEHIND FOR NOW.
GOING TO MOVE IN
TO THINGS. I THINK WE ALL HAVE IN THE BACK OF OUR HEAD,
THE IMAGE OF THE GEEK DOWN THE HALL WITH THE COKE BOTTLE GLASSES AND BIG MACHINES AND JUST
STRUCTURE, BEING SOMETHING IMPENETRABLE. SOMETHING HARD TO UNDERSTAND BECAUSE OF THE
TECHNOLOGY THAT IS INVOLVED IN GENERATING THOSE THREE DIMENSIONAL STRUCTURES TRYING
TO FIGURE OUT WHAT TRANSFORMERS ARE ALL OF THAT.
WHAT I WANT TO SHOW YOU THIS MORNING, YOU DON'T HAVE TO CONCERN YOURSELVES WITH HOW
WE GOT THE ACTUAL STRUCTURES BUT THERE ARE SOME VERY EASY TO USE TOOLS THAT YOU CAN USE
TO NOW ANSWER QUESTIONS ABOUT STRUCTURAL SIMILARITIES. AND, THE REASON I WANT TO MAKE SURE THAT YOU
KNOW ABOUT THESE TOOLS IS BECAUSE OF THE VERY BASIC TENAMENT THAT CHRIS, BACK IN THE 1950S
WON THE NOBEL PRIZE FOR. SEQUENCE SPECIFIED CONFIRMATION.
THIS WAS BATTERED IN ALL EVER OUR HEADS IN BASIC BIOCHEMISTRY.
BUT CONFIRMATION DOES NOT SPECIFY SEQUENCE. THE CONVERSE OF THAT STATEMENT IS NOT TRUE.
SO, YOU MIGHT HAVE MULTIPLE STRUCTURES WHERE YOU SEE SIMILARITY AT THE STRUCTURAL LEVEL.
WHETHER IT IS IN A *** DOMAIN OR ACROSS THE ENTIRE PROTEIN.
BUT IF YOU LOOK AT THE UNDERLYING SEQUENCES THAT MAKE UP THAT PROTEIN, YOU MIGHT SEE VERY,
VERY LITTLE SEQUENCE SIMILARITY. THERE ARE CASES IN THE LITERATURE WHERE THAT
PERCENT IDENTITY GOES DOWN TO THE TEN, 11, 12% RANGE.
WHY IS THAT IMPORTANT TO YOU? WHEN YOU DO YOUR BLAST SEARCHES, BLAST TENDS
TO START TO FAIL BELOW 25% SEQUENCE IDENTITY. THIS IS VERY WELL DOCUMENTED IN THE LITERATURE.
AS YOU RECALL LAST WEEK I GAVE YOU 25% RULE AS ONE OF YOUR CRITERIA TO USE FOR DETERMINING
BIOLOGICAL SIGNIFICANCE OF YOUR BLAST HITS. START TO ENTER THAT TERRITORY WHERE BLAST
IS NO LONER THE TOOL OF CHOICE, NOW THIS IS THE THING THAT SHOULD YOU BE HAVING YOUR ARSENAL
READY TO GO TO FIND THINGS, AS I PUT HERE ON THE SLIDE, CANNOT NECESSARILY BE DETECTED
THROUGH TRADITIONAL METHODS. AGAIN, A LITTLE BIT OF BACKGROUND.
THIS IS COOL. WHAT WE'RE GOING TO DO NOW IS COMPARE EVERY
KNOWN PROTEIN STRUCK TOWER EVERY OTHER KNOWN PROTEIN STRUCTURE TO, GIVE YOU A SENSE OF
THE SIZE OF THAT, AS OF THIS MORNING, THERE ARE 62,000 ENTRIES IN THE PROTEIN DATA BANK.
58,000 OF THEM REPRESENT PROTEIN SEQUENCES TO DO EITHER BY NMR OR X-RAY.
NOW, TO DO A COMPARISON OF ONE STRUCTURE TO ANOTHER STRUCTURE USING THE MOST ROBUST METHODS
THAT WE HAVE TAKE ON THE ORDER OF WEEKS TO MONTHS DEPENDING HOW MUCH COMPUTING.
NOW, IF YOU MULTIPLY. YOU HAVE NOW COMPUTATIONALLY INTRACTABLE.
IN ORDER TO. THE WAY WE'RE GOING TO DO THIS, IS, THIS IS
DONE FOR YOU SO YOU UNDERSTAND WHAT IS GOING ON.
WE'RE GOING TO TAKE EACH ONE OF THESE 58,000 PROTEIN STRUCTURES, SO HERE IS JUST AN EXAMPLE
OF A PARTICULAR STRUCTURE. WE SEE A BUNCH OF BLUE BITS SO THE BLUE BITS
HERE ARE THE ALPHA HELICES. THE GREEN LINES REPRESENT THE BETA STRANDS.
WHAT WE'RE GOING TO DO IN THE FIRST STEP IS GET RID OF ALL OF THE LOOP REGIONS, ANYTHING
THAT DOES NOT EXIST EITHER IN ALPHA HELIX OR IN A BETA STRAND.
THAT IS WHAT WE HAVE IN THE SECOND PART OF THE PICTURE HERE.
WHAT THE METHOD THEN DOES IS FOR EVERY ALPHA HELIX, IT PASSES A VECTOR RIGHT THROUGH THE
CENTER TO APPROXIMATE THE PATH OF THAT HELIX KEEPING IN MIND WHICH END IS THE END TERMINAL
END. WHICH END IS THE C TERMINAL END.
IF IT SEES A BETA STRAND IT WILL JUST DO THE SAME THING, DRAW VECTOR THAT PROXIMATES THE
PATH OF THAT BETA STRAND. ONCE THAT IS DOES, ALL THAT INFORMATION IS
THROWN AWAY. BY THE TIME WE GET TO THE LAST STEP.
EVERY SINGLE ATOMIC COORDINATE HAS BEEN THROWN OUT.
BUT BASED ON THOSE COORDINATES WE NOW HAVE A SERIES OF VECTORS.
FOR EACH ONE OF THOSE VECTORS WE KNOW WHICH ONE IS ALPHA HELIX, WHICH ONE IS A BETA STRAND.
WHICH END IS THE END TERMINAL END, WHICH IS THE C DETERMINATE END.
WHICH ONE CONNECTS TO THE NEXT ONE IN THE SERIES.
THAT IS NOW GOING TO BE THE BASIS FOR OUR COMPARISON AND IS NOW TURNS IN TO A GLORIFIED
GAME OF PICK-UP STICKS. SO, WE NOW HAVE HERE FOR THE EXAMPLE TWO PROTEIN
STRUCTURES, PROTEIN ONE AND PROTEIN TWO. FIRST ONE HAS FOUR SECONDARY STRUCTURAL ELEMENTS,
THE SECOND ONE HAS FIVE. ARGUMENT SAKE WE'LL SAY THEY'RE ALL ALPHA
HELICES. NOW GOING TO OVERLAY THESE EVERY WAY WE CAN
TO FIND OUT WHETHER OR NOT THESE ARE STRUCTURALLY SIMILAR TO EACH OTHER.
IN THE FIRST ALIGNMENT, WE MIGHT TAKE ALL FOUR OF THESE SECONDARY STRUCTURAL ELEMENTS.
OVERLAY THEM WITH THE FOUR SECONDARY STRUCTURAL ELEMENTS AND WHAT WE SEE, THEY'RE YOU WILL
GOING PRETTY MUCH ON TOP EACH OTHER AND ALL GOING IN THE SAME DIRECTION.
OF COURSE THIS IS DONE MUCH MORE ROBUSTLY THAN THIS.
NOT DONE BY EITHER, ACTUALLY, MATHEMATICS BEHIND IT.
I THINK YOU GET THE IDEA BASICALLY IF WE SEE SOMETHING LIKE THIS WE WOULD DEEM THAT TO
BE A GOOD MATCH OF THE TWO STRUCTURES OVER THAT REGION TO ONE ANOTHER.
LET'S TAKE ANOTHER ALIGNMENT WHERE YOU MIGHT TAKE ALL FOUR OF THESE ALPHA HELICES FROM
PROTEIN ONE AND COMBINE THOSE WITH ONE, TWO, THREE AND FIVE.
FROM PROTEIN TWO. AS BEFORE, ONE TWO, AND THREE ALL GOING THE
SAME WAY. PRETTY MUCH THE SAME PATH.
FIVE IS OFF DOING IT'S OWN THING. WE WOULD NOT CONSIDER THAT TO BE A GOOD ALIGNMENT.
THAT IS JUST DONE OVER AND OVER AGAIN EVERY POSSIBLE COMBINATION THAT THE COMPUTER CAN
COME UP WITH. IT DOES THE MATH IN THE BACKGROUND.
WHAT YOU END UP GETTING AT THE END OF THE DAY IS SOMETHING LOOKING LIKE THIS.
IN YOUR HAND OUT SKIP AHEAD ONE SLIDE. THE NEXT TWO HAVE BEEN TRANSPORESSED IN YOUR
HAND OUT. THIS IS REMARKABLY GOOD.
THESE ARE TWO PROTEIN STRUCTURES THAT HAVE BEEN DEEMED SIMILAR TO ONE ANOTHER BY THIS
METHOD. KEEP IN MIND, WE THREW AWAY ATOMIC COORDINATES.
WE'RE COMPARING THE SERIES OF VECTORS BUT THESE TWO STRUCTURES PRETTY MUCH OVERLAP EACH
OTHER ALMOST PERFECTLY. NOW I SORT OF PUSHED IN THIS CASE WHERE I
PICKED EXAMPLE WHERE THERE'S ONE MU THINGS BETWEEN THE TWO.
BUT YOU SEE TIME AND TIME AGAIN IS JUST INCREDIBLY GOOD MATCH BETWEEN THE STRUCTURE THAT YOU
STARTED WITH AND OTHERS THAT ARE FOUND IN PDB USING THIS BLAST METHOD.
WE'LL COME BACK TO THIS REPRESENTATION IN A MOMENT.
JUST TO REMIND YOU OF SOME OF THE CAVEATS OF THE METHOD.
BY DEFINITION BECAUSE WE HAVE THROWN AWAY ALL OF THOSE ATOMIC COORDINATES IT IS NOT
THE BEST METHOD FOR DETERMINING STRUCTURAL SIMILARITY BECAUSE WE LOST A LOT OF INFORMATION
ALONG THE WAY. WE HAVE LESS CONFIDENCE IN OUR PREDICTIONS.
REGARDLESS OF THE SIMPLICITY OF THE METHOD IT IS A GREAT FIRST APPROXIMATION AND I WILL
SHOW YOU IN A MINUTE YOU CAN DO THIS YOURSELVES RIGHT AT YOUR DESKTOP.
IF YOU FIND SOMETHING THAT HAS PROMISE. USING METHOD SEEK OUT SOMEBODY WHO IS STRUCTURAL
EXPERT, I WANT TO DELVE IN TO THIS A LITTLE BIT MORE.
AND GET SOME HELP IN THAT DIRECTION BUT THIS IS SOMETHING THAT YOU ALL CAN DO YOURSELVES.
HOW DO YOU DO THAT? WE'LL GO BACK TO THE NCBI WEBSITE.
AND SO THIS IS THE NCBI HOME PAGE, WE'RE GOING TO USE THE ENTREE SEARCH ENGINE IN THE UPPER
RIGHT HAND SIDE TO DO OUR SEARCH. IN THE SEARCH PULL DOWN I'VE SELECTED STRUCTURE,
I'M PUT CAN IN AN ASCENSION NUMBER TO GET BACK ONE AND ONLY ONE ENTRY.
IN THIS CASE THE ASCENSION NUMBER IS IN PDV TAKES THE FORM OF NUMBER AND THREE LETTERS.
THE ONE I USED HERE IS TWO LIV. WHILE WE'RE HERE LET ME POINT OUT THE STRUCTURE
LINK HERE AS WELL. IF YOU WANT TO GET TO THE CONSERVE DOMAIN
DATABASE THAT I SHOWED YOU, INSTEAD OF TYPE INK THAT LONG URL YOU CAN JUST CLICK ON THAT
AS WELL. SEARCH STRUCTURE FOR 2LIV, CLICK ON THE SEARCH
BUTTON. THAT NOW TAKES US TO OUR RESULTS PAGE.
WE GET BACK ONE AND ONLY ONE RESULT WHICH WE WOULD EXPECT BECAUSE WE USED ASCENSION
NUMBER HERE. HERE IS JUST A LITTLE PICTOGRAM OF THE STRUCTURE
THAT WE FOUND. ASCENSION NUMBER, TELLS US IT'S PARAPLASMIC
BINDING PROTEIN THAT IS CALLED A LEUCINE ISO LEUCINE BASE.
IF WE WANT TO LEARN A LITTLE BIT MORE ABOUT THIS.
ALL WE HAVE TO DO IS JUST CLICK ON THE ASCENSION NUMBER.
THAT WILL TAKE US TO THIS STRUCTURE SUMMARY PAGE.
AGAIN, PICTOGRAM, REPRESENTATION OF OUR PICTURE, THE PRIMARY REFERENCE THAT DESCRIBES THE SOLUTION
OF THIS PARTICULAR STRUCTURE SO IN THIS CASE THIS IS AN X-RAY STRUCTURE.
AND JUST A LINK BACK TO THAT REFERENCE IF YOU WANT TO READ THE PAPER THAT COMES FROM
E-COLI. AND DOWN BELOW WE HAVE A REPRESENTATION OF
WHAT ELSE WE FOUND IN THIS PARTICULAR PROTEIN. OUR SEQUENCE IS SEQUENCE A, STRUCK THAT YOU
ARE WE STARTED 2LIV. SHOWS US SOME DOMAINS THAT WERE FOUND BELOW
THAT EXIST IN THAT PARTICULAR PROTEIN. WHAT I'M GOING TO DO IS USE THIS GRAPHIC AS
OUR JUMPING OFF TO DO THE SEARCH. WE'LL LET THEM DO THE ALIGNMENT METHOD TO
FIND WHAT OTHER STRUCTURE ARE SIMILAR TO THE ONE THAT I STARTED W. ALL YOU HAVE TO DO IS
JUST CLICK ANY PLACE ON THE BAR THAT IS LABELED SEQUENCE A.
YOUR PROTEIN, ONE THAT I START WITH IS RIGHT HERE.
THIS BAR AT THE TOP. YOU'LL NOTICE BELOW A BUNCH OF SOMETIMES CONTINUOUS,
SOME TIMES DISCONTINUOUS BARS. EACH ONE LABELED WITH A PDB IDENTIFIER.
EACH ONE OF THESE REPRESENTS A PROTEIN DEEMED SIMILAR INSTRUCT TOUR TO THE ONE THAT YOU
STARTED WITH. WHAT THE DISCONTINUITIES REPRESENT ARE PLACES
WHERE YOU DON'T HAVE STRUCTURAL OVERLAP BETWEEN ONE YOU STARTED WITH AND THE ONE THAT WAS
FOUND. THIS IS JUST A VISUAL TO CONVEY TO YOU, DO
I HAVE A GLOBAL ALIGNMENT ACROSS THE ENTIRE LENGTH OF MY PROTEIN AS WE DO IN THE FIRST
CASE WITH THE EXCEPTION OF ONE RESIDUE. AS YOU'LL SEE IN A MOMENT.
OR DO I HAVE SOMETHING WHERE I HAVE DOMAINS IN COMMON.
THAT SHOWS YOU SOME OF THE POWER OF FAST WE DON'T HAVE TO FORCE GLOBAL ALIGNMENT.
SAME ARGUMENTS, WHY WE WANT TO DO LOCAL ALIGNMENTS. AND IN A CURRENT PROTOCOL REFERENCE I'M GOING
TO GIVE YOU A LITTLE BIT LATER YOU CAN CHANGE HOW THIS LOOKS TO RENDER IT AS TABLE, THERE
ARE SOME STATISTICAL GUIDELINES IN THAT UNIT THAT I THINK WOULD BE USEFUL TO YOU IN HELPING
TO DECIPHER WHAT THESE TABLES, WHAT THESE LISTS OF RESULTS ACTUALLY SHOULD BE.
IF YOU CLICK ON THAT THIS WILL LAUNCH A VIEWER WHICH STANDS FOR SEE IN 3D.
I WOULD HAVE LOVE TO BE IN THE PLANNING MEETING TO COME UP WITH THAT NAME.
LET'S LAUNCH SEE IN 3D. YOU'VE SEEN THIS. HERE WE CAN CHANGE HOW THE REPRESENTATION
OF THIS PROTEIN IS RENDERED, IN THIS CASE, RENDERING SOMETHING CALLED TUBES WHICH IS
JUST LINES THAT YOU SEE. THE COLORING IN THIS CASE, I HAVE SET TO IDENTITY.
IN THIS CASE THERE IS ONLY ONE MATCH OF A TO V BETWEEN THE TWO SEQUENCES.
THAT IS THE BLUE BIT ALL THE WAY HERE AT THE TOP.
THE REDS ARE THE MATCHES, THE BLUE IS THE MISMATCHES.
I COULD TAKE MAY CONSIDER SORE JUST HIGHLIGHT ANY PART OF THIS -- MY CURSOR.
IT WOULD LIGHT UP THE SEQUENCE IN YELL LOAF. WHERE EACH PARTICULAR RESIDUE ACTUALLY LIES.
YOU CAN START TO THINK ABOUT THINGS INSTEAD OF JUST BEING A STRING OF LETTERS NOT HAVING
ANY SENSE OF HOW THIS THING FOLDS. NOW YOU KNOW.
THAT'S VERY IMPORTANT. TWO OTHER VIEWS OF THIS, THAT I THINK WOULD
BE IMPORTANT AND AGAIN CURRENT PROTOCOL UNIT DESCRIBE HOW TO GET THESE.
THE FIRST, FOR EACH ONE OF THESE, I'M SORRY THE ALIGNMENT IS A LITTLE SCHNOOKED HERE.
THERE IS ALREADYING SETTING UNDER THE STYLES MENU.
THE ONE ON THE LEFT IS WHAT I LIKE TO CALL THE SEMINAR VIEW BECAUSE THIS IS A VERY NICE
WAY TO ORIENT PEOPLE IN YOUR AUDIENCE TO WHAT YOUR PROTEIN STRUCTURE IS ALL ABOUT.
SO, SETTINGS FOR THOSE ARE SHOWN BELOW. IN THIS STRUCTURE YOU SEE A BUNCH OF GREEN
CRAYOLA CRAYONS. WHAT THEY ARE IS THE ALPHA HELICES.
THE POINTED END IS THE C TERMINAL END. THE BROWN BARS ARE THE BETA STRANDS.
AGAIN, FLATTENED IS THE END TERMINAL. POINTED END IS THE C TERMINAL YOU CAN SEE
HOW THEY ARE ORIENTED THROUGHOUT THE STRUCTURE. NOW, CERTAINLY WE DON'T HAVE PROTEINS IN THE
CELLS FLOATING AROUND WITH A BUNCH OF CRAYOLA CRAYONS.
WE OF A OTHER WAYS TO RENDER THIS. TO GET A MORE REALISTIC REPRESENTATION SO
THE ONE THAT IS ON THE RIGHT IS THE SPACE FILLING REPRESENTATION.
THE BEST APPROXIMATION WE CAN GET USING THIS TOOL OF THE TRUE THREE DIMENSIONAL STRUCTURE
OF THIS PROTEIN. THE SHAPE THAT IS BEING PRESENTED IN THE CELL.
ANY PLACE YOU SEE BLUE IS A POSITIVE CHARGE ANY PLACE YOU SEE RED IS NEGATIVE CHARGE.
EVERYTHING ELSE IS NEUTRAL. AND I'VE ALREADY TOLD YOU THAT THIS IS AN
BINDING PROTEIN. SO, YOU MIGHT BE ABLE TO LOOK AT THESE CHARGE
DISTRIBUTIONS AND SEE HOW THEY MIGHT POINT YOU TOWARDS A BINDING SITE OR SOME OTHER IMPORTANT
PART OF THE MOLECULE THAT IS ACTUALLY INVOLVED IN ITS FUNCTION.
INCREDIBLY EASY TO USE. AS EASY AS IT WAS TO EXPLAIN THIS, IF YOU
GO BACK TO YOUR OFFICES SIT DOWN TWO THIS IT IS THAT EASY TO USE.
IT ENCOURAGE YOU TO DO THIS, IT'S GOING TO HELP YOU CONCEPTUALIZE BETTER WHAT YOUR PROTEIN
OF INTEREST, THE ONE THAT YOU YOURSELVES ARE WORKING ON IN YOUR LABORATORY ARE ALL ABOUT.
YOU NOW HAVE SENSE OF WHAT'S ON THE SURFACE AND WHAT'S BURIED WHAT MIGHT BE IN AN ACTIVE
SITE OR IN A BINDING POCKET OR SOME CATALYTIC SITE SOMETHING ELSE THAT IS IMPORTANT TO WHAT
THIS PARTICULAR ENTITY DOES. TO DETERMINE GAIN OF FUNCTION OR LOSS OF FUNCTION,
YOU CAN HONE DOWN YOUR EXPERIMENTS A LITTLE BIT BETTER.
PLEASE DO TAKE THE TIME TO AVAIL YOURSELVES OF THIS INCREDIBLY USEFUL MORE IMPORTANTLY
WILL SERVE YOU VERY WELL IN THOSE INSTANCES WHERE SIMPLE SEQUENCE COMPARISONS JUST WON'T
BE UP TO THE TASK. SOME ADDITIONAL READING.
ONE MORE TIME. I'VE ALLUDED TO THIS SEVERAL TIMES NOW THIS
IS UNIT THAT I'VE WRITTEN IN CPBI, 1.3 TALKS ABOUT CN 3D TALKED ABOUT YOU'LL HAVE A LITTLE
BIT MORE INFORMATION IN THERE ON HOW TO MAKE THOSE VIEWS AND LABEL THEM, EXPORT THEM GET
THEM IN TO YOUR POWER POINT PRESENTATION. ALSO FOR THOSE OF YOU WHO ARE INTERESTED,
THAT UNIT TAKES YOU THROUGH A VERY RIGOROUS OVERVIEW OF HOW TO USE ENTREE, I KNOW MOST
OF YOU HAVE PROBABLY USED ENTREE AT SOME POINT. BUT YOU PROBABLY KNOW HOW TO FIND THOSE PAPERS
AND MAYBE FIND THE SEQUENCE, MAYBE DON'T EXACTLY KNOW HOW TO USE ENTREE TO ITS BEST ADVANTAGE,
THE FULL POWER. I THINK THAT REALLY YOU SHOULD CONSIDER AS
REQUIRED READING NOT JUST OPTIONAL. FINALLY, THE SECOND ONE HERE IS AN INTRODUCTION
TO MODELING PROTEINS STRUCTURES FROM SEQUENCE. WE DON'T HAVE TIME TO TALK ABOUT THIS TODAY.
BUT LET'S TAKE YOU'VE DONE YOUR FAST SEARCH YOU COME UP WITH NOTHING.
THERE IS NO OTHER STRUCTURE SIMILAR TO THE ONE THAT YOU'RE STARTING WITH.
REMEMBER, THOSE COMPARISONS ARE ALL OF SOLVED STRUCTURES, NOT DOING ANY DE NOVO STRUCTURE
PREDICTION. LET'S SAY DO YOU WANT TO DO THAT.
THAT YOU WANT TO START TO SAY, I WANT TO MODEL MY PROTEIN TO SEE THE EFFECTIVE MUTATION TO
JUST INTERCHANGE RESIDUE FOR ANOTHER TO SEE WHAT WOULD ACTUALLY HAPPEN.
THERE ARE NUMBER OF MORE ADVANCED METHODS ALLOW YOU TO DO THAT.
THIS GIVES YOU OVERVIEW OF THOSE METHODS, WHAT TO USE WHERE AND WHERE TO FIND THEM.
SAY WOULD ENCOURAGE YOU TO DO THAT AS WELL IF YOUR RESEARCH TAKES YOU IN THAT DIRECTION.
NOW, IN THE LAST 25 MINUTES WE'LL FLY THROUGH MULTIPLE SEQUENCE ALIGNMENT.
IT'S IMPORTANT THAT WE TALK ABOUT THIS. A LOT OF THE CONCEPTS WE'VE GONE THROUGH IN
THE LAST LECTURE AND HALF. ALSO SET US UP FOR THINGS IN FUTURE LECTURES
WHERE YOU WILL SEE THESE ALIGNMENTS OVER AND OVER AGAIN, WHETHER IT IS IN THE CONTEXT AGAIN
OF PHYLOGENETIC ANALYSES, EVEN STARTING NEXT WEEK WHEN STARTS TO TALK TO YOU ABOUT GENOME
BROWSER. IT'S IMPORTANT TO UNDERSTAND WHERE THEY COME
FROM, I REALIZE THAT MOST OF YOU HAVE PROBABLY DONE THEM BEFORE BUT I WANT TO GIVE YOU SOME
GENERAL GUIDELINES TO MAKE SURE THAT YOU'RE PERFORMING THEM PROPERLY BASICALLY JUST BRING
YOUR GAME UP A NOTCH TO MAKE SURE YOU ARE USING THESE METHODS IN THE MOST ADVANTAGEOUS
WAY. WHY DO WE EVEN BOTHER DOING THESE THINGS?
WHAT DO WE STAND TO GAIN? WHAT CAN WE LEARN BY DOING THESE SEQUENCE
ALIGNMENTS. ALLOW US TO PATTERNS AND DOMAINS ALL THOSE
THINGS THAT WE SPENT THE FIRST HALF OF THIS MORNING SPEAKING ABOUT.
AGAIN AS I BET ON MANY TIMES NOW HOW THOSE THINGS CAN BE TO YOUR ADVANTAGE WHEN YOU THINK
ABOUT QUESTIONS AND EXPERIMENTAL DESIGN AND PREDICTING THE STRUCTURE AND FUNCTION OF POSSIBLY
UNKNOWN BRAND NEW PROTEIN THAT YOU DISCOVERED. OR TO IDENTIFY NEW MEMBERS OF A PROTEIN FAMILY.
YOU ABSOLUTELY HAVE TO DO A MULTIPLE SEQUENCE ALIGNMENT IF YOU WANT TO DO A PHYLOGENETIC
ANALYSIS. IT'S IMPOSSIBLE TO DO IT ANY OTHER WAY.
BECAUSE ALL OF THE PHYLOGENETIC METHODS DEPEND ON THE CONSTRUCTION OF A MULTIPLE SEQUENCE
ALIGNMENT AS IT'S INPUT. NOT THAT YOU CAN JUST STICK IN BUNCH OF SEQUENCES
GET BACK A TREE, YOU HAVE TO ACTUALLY START OFF WITH A MULTIPLE SEE CONDITIONS ALIGNMENT
THEN THAT ALIGNMENT WILL BE USED TO GENERATE THE TREE.
WE'VE ALREADY TALKED ABOUT THE NEXT POINT OF USING THE GENERATION OF POSITION SPECIFIC
SCORING MATRICES. THIS ALSO MIGHT BOLSTER CONFIDENCE IF YOU
HAVE DONE STRUCTURE PREDICTIONS. WE HAVEN'T TALKED ABOUT PREDICTING STRANDS
IN THESE TWO LECTURES, LET'S SAY YOU HAVE DONE THAT.
AND THE STATISTICAL SUPPORT ISN'T VERY STRONG, IF YOU DO MULTIPLE SEQUENCE ALIGNMENT AND
SEE COMMONALITY IN THOSE SECONDARY STRUCTURAL ELEMENTS ACROSS THE WHOLE HOST OF DIFFERENT
PROTEINS THAT YOU'VE ALIGNED THEN THAT GIVES YOU BETTER CONFIDENCE IN THOSE PREDICTIONS
AS WELL. AGAIN, SAME SPIRIT OF LABORATORY YOU MIGHT
HAVE RESULT THAT YOU'RE NOT EXACTLY SURE OF YOU USE ANOTHER TECHNIQUE TO VERIFY.
SAME GAME APPLIES HERE. WHAT DO WE NEED TO CONSIDER WHEN WE DO ONE
OF THESE MULTIPLE SEQUENCE ALIGNMENTS? OF COURSE LOOKING FOR ABSOLUTE SEQUENCE SIMILARITY.
WHAT WE WANT TO DO IS IN EACH COLUMN GET AS MANY MANY ABSOLUTELY CONSERVED POSITION AS
POSSIBLE. LINE UP MANY COMMON CHARACTER AS WE CAN.
OF COURSE AS WITH OUR SCORING MATRICES WE CAN'T ALWAYS HAVE ABSOLUTE -- SOMETIMES WE
HAVE CONSERVATIVE SUBSTITUTIONS SO WE TAKE THOSE IN TO ACCOUNT AS WELL.
FINALLY YOU MAY BE LUCKY ENOUGH TO HAVE ONE OF THE SEQUENCES IN YOUR SET WHERE THERE IS
A KNOWN STRUCTURE. THAT INFORMATION CAN ALSO BE USED TO FINE
TUNE THE ALIGNMENT, TO GET AT GREATER SUPPORT TO THE ULTIMATE MULTIPLE SEQUENCE ALIGNMENT
THAT YOU GET. SOME GENERAL GUIDELINES.
THINGS TO KEEP IN THE BACK OF YOUR HEAD AS DO YOU THIS YOURSELVES.
WE TEND ONCE AGAIN TO CONCENTRATE ON THE PROTEIN LEVEL RATHER THAN ON THE NUCLEOTIDE LEVEL.
THAT IS JUST BECAUSE IT'S MORE INFORMATIVE THERE.
IS MORE INFORMATION CONTENT IN PROTEIN -- THAN LOOKING AT THE EACH OF THE NUCLEOTIDES
BECAUSE OF THE DIFFERENT STRUCTURE OF EACH OF THE 20 AMINO ACID SIDE CHANGES.
LESS PRONE TO INACCURATE ALIGNMENT. WE'RE TAKING THOSE PHYSICAL CHEMICAL PROBLEMS
IN TO ACCOUNT. YOU CAN CERTAINLY TRANSLATE THESE BACK TO
NUCLEOTIDE SEQUENCES AFTER DOING THE ALIGNMENTS, IT DEPENDS ON THE CONTEXT THAT YOU'RE DOING
THIS. BECAUSE YOU MIGHT BE TRYING TO ALIGN NUCLEOTIDE
SEQUENCES WHERE THERE IS NO PROTEIN TRANSLATION. WE'LL SEE SAMPLES OF THIS IN FOUR WEEKS TALKING
ABOUT REGULATORY ELEMENTS AND CONSIDERATIONS IN HEPA GENETICS.
MORE GUIDELINES. NEED TO USE A REASONABLE NUMBER OF SEQUENCES.
SO THE TEMPTATION IS TO THROW EVERYTHING YOU HAVE AT THE METHOD BUT THAT ACTUALLY WILL
START TO WORK AGAINST YOU IF YOU DO THAT. BECAUSE THIS IS A GLOBAL ALIGNMENT METHOD.
AGAIN WE TALKED ABOUT GLOBAL ALIGNMENTS LAST WEEK.
MORE ALIGNMENTS YOU DO, THE LONGER IT TAKES. AND THE HARDER IT GETS.
MOST OF THE ALIGNMENT ALGORITHM START TO FAIL WHEN YOU TRY LINE UP TOO MANY SEQUENCES.
THE TRUTH IS, YOU SORT OF REACH A POINT OF DIMINISHING RETURN WHERE IF YOU HAVE GONE
FROM 40 SEQUENCES TO 50, ARE YOU REALLY LEARNING ANYTHING MORE BY ADDING THAT ALSO TEN SEQUENCES.
IF YOU SELECTED YOUR SEQUENCES WISELY AT THE BEGINNING YOU REALLY DON'T HAVE TO HAVE VERY
HUGE INPUT SETS. ALSO THE PHYLOGENETIC STUDIES THAT MIGHT ARISE
FROM THIS ARE ALMOST IMPOSSIBLE TO DO. I REMEMBER ONCE WHERE I TRIED TO DO AN ALIGNMENT,
PHYLOGENETIC TREE ON A SET THAT HAD SOMETHING LIKE 130 SEQUENCES ON IT.
IT TOOK A MONTH FOR THE COMPUTER TO FINALLY GET AROUND TO GIVING ME AN ANSWER.
IT BECOMES VERY, VERY COMPUTATIONAL UNREASONABLE. TEN TO 15 SEQUENCES THAT SEEMS TO BE WHAT
FOLKS LIKE TO USE IN THE LITERATURE. YOUR BALLPARK UPPER LIMIT IS AROUND 50 BEFORE
YOU START TO SEE SOME OF THE PROBLEMS THAT I'VE MENTIONED TO YOU.
THAT TAKES CARE OF THE NUMBER OF SEQUENCES. WHAT ABOUT THE NATURE OF THE SEQUENCES.
IT'S BECAUSE GLOBAL SEQUENCE ALIGNMENT, AGAIN WORKS BEST WHEN YOU HAVE SEQUENCES OF ABOUT
THE SAME LENGTH. YOU WANT TO USE CLOSELY RELATED SEQUENCES,
THOSE WILL TELL YOU WHAT, QUOTE REQUIRED. WHAT RESIDUES ARE ABSOLUTELY CAN SERVE.
IF YOU USE MORE DIVERGENT, YOU CAN USE THOSE TO STUDY EVOLUTIONARY RELATIONSHIP IN THAT
GROUP OF SEQUENCES. YOU WANT SOME OF BOTH.
USUALLY GOOD STARTING POINT, SEQUENCES TEND TO BE 30 TO 70% SIMILAR.
YOU HAVE FAIRLY LARGE BERTH TO WORK IN HERE. LAST POINT IS REALLY THE MOST IMPORTANT ONE
THAT THE MOST INFORMATIVE ALIGNMENTS REALLY COME WHEN YOU HAVE A COMBINATION OF NEEDS,
THINGS THAT ARE NOT TOO SIMILAR. IF THEY'RE ALL TOO SIMILAR YOU'RE NOT GOING
TO LEARN ANYTHING YOU ALREADY KNOW FROM THE SINGLE SEQUENCE WHAT YOU HAVE TO -- WHAT YOU
CAN FIND OUT IF THEY'RE ALMOST ALL EXACTLY THE SAME.
BUT IF THEY'RE TWO DIFFERENT YOU THEN END UP IN SITUATION WHERE YOU JUST COMPUTATIONALLY
CAN'T ALIGN THEM. AGAIN, TEN TO 15 AS YOUR STARTING POINT.
THIS SHOULD ALSO BE AN ITERATIVE PROCESS WHERE YOU START WITH THAT TEN TO 15.
DO THE ALIGNMENT, SEE HOW IT LOOKS. EXAMINE THE QUALITY OF THE ALIGNMENT.
WHAT I MEAN BY THAT IS, HOW MANY GAPS. EACH ONE OF THOSE REPRESENT THE BIOLOGICAL
EVENT. EITHER INSERTION OR DELETION YOU HAVE TO KEEP
THOSE TO REASONABLE NUMBER. YOU WANT WILLY-NILLY PUT THEM IN.
IF THE ALIGNMENT LOOKS GOOD, ADD SOME MORE, DO THE ALIGNMENT AGAIN JUST KEEP GOING IN
THAT FASHION. IF YOU SEE THAT THE ALIGNMENT IS STARTING
TO BREAK DOWN, SO, FOR EXAMPLE, YOU MIGHT HAVE JUST INSERTED A SEQUENCE THAT NOW IS
PUTTING INORDINATE AMOUNT IN, TAKE THAT OUT. THERE'S ELEMENT OF FINE TUNING.
YOU WILL LEARN OVER TIME IT'S ACTUALLY RATHER INTUITIVE.
NOW, THAT'S HOW TO MAKE THEM. HOW DO YOU INTERPRET THEM?
WHEN YOU SEE A PARTICULAR COLUMN IN YOUR MULTIPLE SEQUENCE ALIGNMENT WHERE YOU HAVE ABSOLUTELY
CONSERVED POSITIONS, THAT INDICATES THAT THOSE ARE REQUIRED FOR PROPER STRUCTURE AND FUNCTION.
THEY HAVE BEEN CONSERVED FOR A REASON. WHEN YOU SEE RELATIVELY WELL CONSERVED POSITIONS
THOSE ARE THE ONES WHERE YOU CAN SAY, ALL RIGHT, I CAN TOLERATE A CERTAIN AMOUNT OF
CHANGE AND NOT ADVERSELY AFFECT THE STRUCTURE OR FUNCTION OF THE PROTEIN.
MOST PEOPLE TEND TO CONCENTRATE ON THESE LOOKING AT THE COMMONALITY.
BUT I THINK IT'S ACTUALLY QUITE INTERESTING SOMETIMES TO LOOK AT THE DIFFERENCES, THE
POSITIONS THAT ARE NOT CONSERVED BECAUSE THOSE ARE ALLOWED TO MUTATE FREELY.
THIS IS SORT OF SOURCE OF EVOLUTIONARY INNOVATION WHERE MOTHER AND NATURE CAN ACTUALLY COME
UP WITH CHANGES IN THOSE PROTEINS THAT CAN BE TOLERATED BECAUSE THE ORIGINAL FUNCTION
IS SUPPORTED. BUT MAYBE START TO DEVELOP NEW PROTEINS THAT
HAVE SLIGHTLY DIFFERENT FUNCTIONS IN THE CELL. IF YOU SEE GAP FREE BLOCKS, THOSE ARE USUALLY
REGIONS OF SECONDARY. YOU JUST WANT HAVE A GAP IN ONE OF THOSE.
IF YOU SEE GAP RICH BLOCKS THOSE ARE USUALLY ON STRUCTURED REGIONS OR OTHER REGIONS.
THAT IS WHAT IT IS. ENOUGH OF THAT.
HOW DO WE DO THIS. METHOD I'M GOING TO DESCRIBE SOMETHING CALLED
-- SO THIS ALSO COMES FROM THE FOLKS AT THE SANGER INSTITUTE AND THIS JUST VERY SIMPLY
ALLOWS YOU TO TAKE A SEQUENCE SET OF INTEREST AND DO YOUR MULTIPLE SEQUENCE ALIGNMENT.
THERE'S A STAND ALONE VERSION THAT YOU CAN DOWNLOAD.
WEB-BASED VERSION IS VERY NICE AND THAT'S ONE I'M GOING TO SHOW YOU THEN YOU DON'T ALSO
HAVE TO WORRY, DO I HAVE THE LATEST VERSION. HOW DOES THIS ACTUALLY WORK.
AGAIN, TO GET US AWAY FROM THE BLACK BOX, A LITTLE BIT OF BACKGROUND.
SO, WHAT THIS IS USING IS A METHOD CALLED PROGRESSIVE ALIGNMENT METHOD.
REGARDLESS OF THE NUMBER OF SEQUENCES THAT I START WITH, IT'S GOING TO ONLY ALIGN TWO
SEQUENCES AT A TIME. NOT GOING TO ATTEMPT TO ALIGN THEM ALL AT
THE SAME TIME. AND ON THESE PAIRS OF ALIGNMENTS IS SEQUENCE
ALIGNMENT CLUSTERING THEM ON THE BASIS OF SIMILARITY.
IT'S GOING TO USE THE SAME KIND OF MATRICES THAT WE TALKED ABOUT LAST WEEK.
THE SAME KIND OF GAP PENALTIES TO CALCULATE ALIGNMENT THAT HAVE BAD SCORE.
BY DOING IT THIS WAY, TWO MAJOR ADVANTAGES. ONE IT'S FAST.
AND THE ALIGNMENTS ARE GENERALLY VERY HIGH QUALITY.
SO THAT'S WHY I LIKE TO USE THIS KIND OF METHOD WITH SOME CAVEATS.
WHAT DOES THIS ACTUALLY MEAN WHEN WE SAY PROGRESSIVE ALIGNMENT.
HERE IS A SEQUENCE SET THAT I PUT TOGETHER, FOUR SEQUENCES, A, B, C AND D.
AND IN THE FIRST STEP WHAT I WANT TO DO JUST CALCULATE HOW IDENTICAL EACH ONE OF THESE
IS TO ALL OF THE OTHERS. A TO B, B TO C, C TO D.
SO ON. AND AGAIN WITH THE W APOLOGIES FOR THE ALIGNMENTS
HERE. IN ORDER TO DO THAT, IT WILL DO ALL OF THESE
ALIGNMENTS BECAUSE OF THIS EQUATION HERE THAT DICTATES HOW MANY ALIGNMENTS HAVE TO BE DONE
BASED ON THE NUMBER OF SEQUENCES. IF WE HAVE FOUR SEQUENCES, IT RESULTS IN SIX
ALIGNMENTS. BUT YOU CAN SEE NUMBERS GET PRETTY BIG AS
WE GO DOWN. IF YOU GOT 100 SEQUENCES YOU ARE AT 5,000
ALIGNMENTS THIS SORT OF DRIVES HOME WHAT I WAS SAYING EARLIER ABOUT THE SET BEING A LITTLE
TOO BIG. HERE ARE FOUR SEQUENCES AGAIN.
I HAVE COMPUTED MY SCORING MATRIX. A, B, C AND D GOING ACROSS THE TOP.
A, B, C AND D GOING DOWN THE BOTTOM. ACROSS THE DIAGONAL IS COMPARISON TO CELLS,
OF COURSE THEY ARE 100% IDENTICAL TO THEMSELVES. BUT NOW I'M JUST GOING TO LOOK FOR LARGEST
NUMBER IN THE TABLE TO SEE WHICH ONES ARE MOST RELATED TO THE OTHER.
WHAT I SEE HERE IS, A IS MOST RELATED TO B. 80%.
C IS MOST RELATED TO D. 92%.
I'M GOING TO TREAT A AND B TOGETHER. AND C AND D TOGETHER.
A AND D SHARE GREATER SIMILARITY WITH EACH OTHER THAN C OR D.
WHAT I'M GOING TO DO, I'LL TAKE A AND B AND ALIGN THEM WITH EACH OTHER, CREATE ALIGNMENT
CALLED AB AND GOING TO FIX THAT. SAME THING WITH C AND D.
ALIGN THOSE TWO, FIX ALIGNMENT. NOW, IN THE NEXT STEP I'M GOING TO TAKE THAT
SIX ALIGNMENT OF A AND B, ALIGN THAT WITH THE SIX ALIGNMENT OF C AND D.
NOW USING BLOCKS OF ALIGNMENT AND ALIGNING THE ALIGNMENTS.
JUST DO THAT AS MANY TIMES AS WE HAVE TO UNTIL WE GET ALL OF THE SEQUENCES IN TO THE ALIGNMENT.
WE START WITH INDIVIDUAL SEQUENCES. WE BUILD THESE LITTLE SUBALIGNMENTS, ALIGN
THE SUBALIGNMENTS WITH EACH OTHER. WHAT A ALLOWS YOU TO DO, STARTING WHERE THERE
IS GREATEST AMOUNT OF IDENTITY WE DO EASY ONES FIRST.
WE USE THAT INFORMATION THAT WE BUILD ALONG THE WAY TO INFORM HOW TO DO THE HARDER ALIGNMENT.
NOT THAT DIFFERENT THAN PHILOSOPHY THAT WE USE IN PSI BLAST.
WE FIND THE THINGS THAT ARE COMMON THEN USING THE COLLECTIVE CHARACTERISTICS TO BUILD OUT.
THE PROBLEM WITH THAT, THOUGH, IS IF THERE IS AN ERROR IN THE INITIAL ALIGNMENT YOU FIX
IT YOU'RE GOING TO PROPAGATE THAT ERROR THROUGHOUT THE ALIGNMENT.
WHAT IS NICE IN THIS NEW VERSION OF -- WHY I LIKE THIS OVER SOME OF THE OTHER PROGRESSIVE
ALIGNMENT METHODS ALLOWS SOMETHING CALLED PREMOVE FIRST -- REMOVE FIRST STEP WHERE YOU
CAN BACKTRACK THINGS OUT. IT WILL RECALCULATE TO IMPROVE THE ALIGNMENT.
I'LL SHOW YOU HOW TO THROWS THOSE FLAGS WHEN WE GET TO THE SCREEN.
WHAT DO WE GET? ONCE WE DO THAT, WE GET THE SCORES THAT ARE
USED TO INFORM HOW TO DO THIS BUILD UP OF THE ALIGNMENT.
LUCKILY WE GET MULTIPLE SEQUENCE ALIGNMENT OUT THAT'S THE WHOLE POINT.
ALSO GET TWO TREE-BASED REPRESENTATIONS. WHO ONE CALLED A CLADAGRAM, THIS IS JUST THAT
BUILT UP TREE. WHAT SEQUENCES WERE ALIGNED TOGETHER -- IT
GIVES YOU SOME IDEA OF COMMON ANCESTRY BUT YOU DON'T HAVE BRANCH LENGTHS THAT GIVES YOU
INDICATION OF EVOLUTIONARY TIME. THE PHYLOGRAM BASICALLY THE SAME IDEA BUT
NOW YOU DO THE BRANCH ALSO DO VARY TO GIVE YOU AT LEAST A VISUAL IDEA PROPORTIONALLY
OF HOW MUCH EVOLUTIONARY CHANGE AS TAKEN PLACE OVER TIME.
YOU WHERE I GET THAT CONSERVATION PATTERN. SO WHEN YOU SEE THE COLORS ON THE VARIOUS
ALIGNMENT, THIS IS THE SCHEME THAT IT USES TO DETERMINE WHAT IS A CONSERVATIVE SUBSTITUTION,
SO THE AROMATICS ARE ALL TOGETHER, POSITIVELY CHARGED RESIDUES AND NEGATIVELY AND OTHER
CLASSES THAT ARE COMMON TO END THE HELICES AND 15 IS SPECIAL CLASS OF ITS OWN BECAUSE
OF ITS ROLE IN MAINTAINING THOSE 15 CROSS BRIDGES THAT ARE VERY IMPORTANT TO STRUCTURE
AND FUNCTION. THE INTERPRETATION HERE IS IMPERICAL.
I DON'T HAVE NUMBERS TO POINT YOU TO AS I HAVE WITH THE OTHER METHOD TO GIVE YOU GENERAL
STARTING POINTS FOR CUT OFFS AND SIMILAR CONSIDERATIONS. SO THE INTERPRETATION STRICTLY IMPERICAL.
WHAT YOU WILL GET BACK FROM EACH COLUMN IN YOUR ALIGNMENT AT THE BOTTOM OF EACH YOU'LL
SEE ONE OF THESE MARKS OR JUST A SPACE. IF YOU SEE A STAR THAT IS ENTIRELY -- YOU
WANT TO SEE THOSE STARS AT LEAST 10% OF THE POSITIONS ACROSS YOUR ALIGNMENT TO CONSIDER
THAT TO BE A GOOD ALIGNMENT. THAT IS GENERALLY ACCEPTED INDICATION OF A
GOOD ALIGNMENT. CONSERVATION IS DICTATED BY THOSE GROUPS THAT
I JUST SHOWED YOU, SO IF YOU HAVE ONLY RESIDUES IN THOSE GROUPS AT A PARTICULAR POSITION YOU'LL
SEE THE COLON HERE ACCORDING TO THAT COLOR TABLE.
THIS ONE IS A LITTLE INTERESTING. IF YOU JUST SEE A DOT THIS IS WHAT THEY HAVE
COME UP WITH CALLING SEMI CONSERVE. THAT JUST MEANS YOU HAVE RESIDUES FROM TWO
OF THOSE CONSERVED CLASSES BUILT NO MORE THAN TWO OF THOSE.
REALLY, THE ONES SHOULD FOCUS IN ON ARE THE FIRST TWO HERE.
THIS IS JUST COLORING TABLE. YOU CAN LOOK AT THAT ON YOUR OWN.
HERE IS THE SCREENS THAT I WANT TO FOCUS ON. STARTING OFF FIRST THING I WANT TO SHOUGHS
WHERE IT SAYS, MATRIX, RIGHT NOW IT SAYS DEFAULT. THE DEFAULT.
BUT WE HAVE AS BEFORE CHOICES HERE. WE SPEND MOST OF OUR TIME LAST WEEK TALKING
ABOUT THE BLOSSOM MATRICES, I MENTIONED THE PAM MATRICES.
YOU'LL NOTICE THAT THERE'S NO NUMBER HERE. REMEMBER LAST WEEK, BLOSSOM 62, BLOSSOM 80
AND SO ON, WHAT THE METHOD WILL DO IS PICK THE APPROPRIATE MATRIX DEPENDING ON WHICH
SEQUENCES IT'S TRYING TO ALIGN. IT WILL CHANGE THEM AS IT GOES THROUGH YOUR
SEQUENCE SET. I USUALLY PICK BLOSSOM THERE.
THE OTHER THING I WANT TO SHOW YOU HERE IS WHAT'S IN THE RED BOX IS HOW YOU CONTROL THAT
REMOVE FIRST. YOU DON'T END UP IN STATISTICAL MINIMUM WHERE
YOU MADE A WRONG ALIGNMENT THAT PROPAGATES THROUGH YOUR TREE.
YOU CAN USE THESE SETTINGS TO DICTATE WHEN THAT PROCEDURE TAKES PLACE.
IF YOU PICK TREE UNDER ITERATION IT WILL HAPPEN AT EACH AND OVERSTEP MORE COMPUTATIONALLY
INTENSIVE BUT PROBABLY IS THE SAFER BET IF YOU LEAVE IT ON ALIGNMENT ONLY DO IT IN THE
FINAL STEPS. NUMBER MUCH ITERATIONS DEFAULT IS THREE.
AGAIN JUST BUMP ALL THE WAY TO TEN. DOWN BELOW I'VE JUST PASTED FIVE SEQUENCES
IN HERE. THESE ARE ALL PROTEINS AND ALSO ON THAT WEBPAGE
THAT YOU HAVE FOR YOU TO DO YOUR PRACTICE FROM.
AT THIS POINT I WOULD CLICK "RUN" DOWN AT THE BOTTOM HERE.
AND THIS IS WHAT I GET. AT THE TOP JUST SERIES OF LINKS THAT WILL
ALLOW ME TO JUMP AROUND MY PAGE. RIGHT BELOW I JUST HAVE ALL OF MY SCORES THAT
WERE USED TO INFORM HOW THIS PARTICULAR ALIGNMENT WAS CONSTRUCTED.
IF I SCROLL DOWN HERE IS MY ALIGNMENT AFTER ALL OF THAT YOU HAVE THE ALIGNMENT ACCORDING
TO THE COLOR TABLE IN YOUR HAND OUT. I'LL SHOW YOU HOW TO CHANGE THIS MOMENTARILY.
GOING FURTHER DOWN THE PAGE HERE IS THE CLADOGRAM, YOU CAN GET A SENSE, ALSO SENSE OF PHYLOGENETIC
RELATIONSHIPS. AGAIN THE CLADO GRAM DOESN'T GIVE YOU ANY
SENSE OF EVOLUTIONARY DISTANCE. IF YOU CLICK ON PHYLOGRAM IT WILL RECAST TO
THIS FORMAT WHERE YOU HAVE BETTER SENSE OF EVOLUTIONARY DISTANCE.
THIS IS NOT A PHYLOGENETIC TREE. I WILL SHOW YOU HOW TO MAKE IT.
THAT IS NOT IT. REMEMBER THAT.
HOW ARE WE GOING TO GET THERE? WE'LL USE TOOL CALLED -- THIS IS JAVA APPLET
THAT CAN BE USED TO MANUALLY EDIT YOUR ALIGNMENT. LET'S SAY METHOD HAS CREATED THE ALIGNMENT
BUT YOU WANT TO FIDDLE WITH IT AND MOVE THINGS A LITTLE BIT OVER, MIGHT HAVE REASON TO SAY,
WELL THIS RESIDUE SHOULD REALLY BE ALIGNED IN THIS COLUMN RATHER THAN THIS.
SO YOU CAN ACTUALLY HAVE SOME FINE CONTROL, YOU CAN CHANGE THE COLORS, YOU CAN DO CONSENSUS
SEQUENCE, CALCULATIONS ARE MORE IMPORTANTLY SECOND FROM THE BOTTOM THAT IS WHERE WE'LL
MAKE OUR PHYLOGENETIC TREE. TO GET TO THE APPLET THERE IS BUTTON ON THE
TOP OF YOUR RESULTS PAIN THAT SAYS START. IF YOU CLICK ON THAT, YOU GET NEW WINDOW POPPING
UP THAT LOOKS LIKE THIS. ON THE NEXT SEVERAL SLIDES UP AT THE TOP HERE
I'M GOING TO GIVE YOU A PATH. THE MENUS ARE IN THESE TEENY, TINY, BARELY
VISIBLE LIKE ONE POINT TYPE HERE AT THE TOP. BUT I JUST WANT TO YOU KNOW WHERE THOSE ARE.
HERE IS YOUR DEFAULT VIEW. OUR FIVE SEQUENCES GOING FROM ONE TO THE END.
THREE HISTOGRAMS TO INDICATE THE QUALITY OF YOUR ALIGNMENT.
THE FIRST LINE SAYS, CONSERVATION. THIS IS JUST INDICATION OF PERCENT IDENTITY.
HOW IDENTICAL IS THAT POSITION, THAT GOES HAND IN HAND WITH THE ALIGNMENT QUALITY SO
THESE USUALLY PARALLEL EACH OTHER AND FINALLY AT THE BOTTOM YOU SEE CONSENSUS AUTO QUEENS
IN SOME POSITIONS YOU'LL SEE PLUS SIGN THOSE ARE JUST POSITIONS WHERE NO CONSENSUS SEQUENCE
COULD BE RELIABLY DETERMINED. NOW, THAT IS MY DEFAULT VIEW.
LET'S PLAY WITH THIS. FIRST ONE, IF I GO TO THESE MENUS, GO TO THE
COLOR MENU THEN PICK "PERCENT IDENTITY" THE COLOR SCHEME CHANGES TO THE SHADES OF BLUE.
HERE IS WHAT THE SHADES OF BLUE CORRESPOND TO.
WHY THIS IS USEFUL TO YOU IS THIS VERY QUICKLY ALLOWS YOU TO FIND MOTIFS IN YOUR OWN ALIGNMENT
THAT ARE PUNITIVELY IMPORTANT. SO, WHEN DO YOU THIS, YOU SHOULD LOOK FOR
BLOCKS OF HIGH OR ABSOLUTE SEQUENCE IDENTITY WHAT THIS IS GOING TO TELL SAW WHAT PARTS
OF THESE SEQUENCES HAVE HAD ON THEM SOME SORT OF EVOLUTIONARY PRESSURE TO NOT CHANGE.
TO KEEP THOSE RESIDUES CONSERVED. LET'S CHANGE IT ONE MORE TIME.
NOW WE GO TO THE CALCULATE MENU IN THIS TEENY, TINY MENU THEN JUST ASK FOR AN ALIGNMENT BEFORE
DOING THAT, I HIGHLIGHTED TWO OF THE SEQUENCES HERE.
I CLICK ON ALIGNMENT THERE IS MY PAIR WISE ALIGNMENT TO EACH OTHER AND ALSO GIVE ME A
SENSE OF PEARS IDENTITY AS WELL. LET'S DO IT AGAIN.
THIS TIME, I SELECT ALL OF MY SEQUENCES, I GO TO THE CALCULATE MENU.
I ASK IT TO CALCULATE A TREE. THERE ARE FOUR CHOICES THERE, THE ONE THAT
I PICK ASKED SOMETHING CALLED NEIGHBOR JOINING USING BLOSSOM 62.
HERE NOW IS OUR FIRST PHYLOGENETIC TREE. THIS SHOWS YOU THE RELATIONSHIP BETWEEN THE
FIVE SPECIES HERE. THE FIVE SEQUENCES, MOUSE BEING MOST RELATED
TO HUMAN, RAT BEING MOST RELATE TO MOUSE, SO ON.
YOU CAN OVERLAY EVOLUTIONARY DISTANCE INFORMATION, BOOTSTRAP VALUES AND STUFF ON THESE TO GET
SENSE OF THE QUALITY OF YOUR PHYLOGENETIC TREE.
I LIKE THIS A LOT BECAUSE IT'S INTEGRATED IN WITH THE MULTIPLE SEQUENCE ALIGNMENT.
YOU DON'T HAVE TO TAKE YOUR DATA OFF TO ANOTHER PROGRAM TO MAKE THOSE TREES.
BUT THIS WILL ONLY TAKE YOU SO FAR, SOME OF MORE ADVANCE TREE BUILDING METHODS HAVE MANY
MORE OPTIONS AVAILABLE TO YOU WHERE YOU CAN FINE TUNE WHAT THESE ALIGNMENTS LOOK LIKE
AND DICTATE HOW MANY TIMES IT SHOULD RECALCULATE AND SO ON.
IN THE LAST TWO MINUTES. ONCE MORE, FURTHER READING, ENTIRE UNIT IN
CPBI ON CLUSTER W. EXAMPLES FOR TO YOU WORK THROUGH.
THERE'S ANOTHER METHOD THAT I WANT TO POINT YOUR ATTENTION CALLED T-COFFEE.
EVERYTHING WE DID IN THE EXAMPLE JUST NOW RELIED ON SEQUENCE DATA TO CONSTRUCT THE ALIGNMENT.
WHAT T-COFFEE WILL DO IS IF THERE IS A STRUCTURE AVAILABLE FOR ANY OF THE SEQUENCES IN YOUR
SET, IT WILL USE THAT INFORMATION TO BETTER CONSTRUCT THE ALIGNMENT.
AGAIN, USING THAT THREE DIMENSIONAL INFORMATION TO GIVE YOU A BETTER ALIGNMENT AT THE END
OF THE DAY. SO, WITH ALL OF THAT, HOPEFULLY NOW OVER THE
LAST THREE HOURS, YOU STARTED TO GAIN SOME APPRECIATION FOR WHY SOME OF THE BASIC UNDERSTANDING
OF WHAT UNDERLIES THESE METHODS IS IMPORTANT. I KNOW MANY OF YOU HAVE USED THE METHODS BEFORE
BUT HAVE PROBABLY JUST, AS I MENTIONED IN THE VERY BEGINNING OF THE FIRST LECTURE STUCK
THE SEQUENCE IN THE BOX, CLICKED ON BLAST OR GO, AND GOTTEN YOUR RESULTS BACK.
THERE ARE THINGS THAT YOU NEED TO BE MINDFUL OF TO MAKE SURE YOU USE THESE METHODS TO THEIR
BEST ADVANTAGE. HOEFULLY I'VE GIVEN YOU MINUTES TO TAKE AWAY
TO USE THEM TO YOUR BEST ADVANTAGE. MORE IMPORTANT PLEA TO AVOID SOME OF THE PITFALLS
THAT OTHERS HAVE FALLEN IN TO. IT DOESN'T REQUIRE A HARD CORE UNDERSTANDING
OF THE TECHNIQUES. WE DIDN'T GO IN TO ANY MATH.
I'VE SHOWN YOU EQUATION HERE BUT WE HAVEN'T DISCUSSED THEM.
JUST THAT YOU HAVE A GENERAL UNDERSTANDING OF HOW THEY WORK SO YOU MAKE THE BEST CHOICES.
SO THAT YOU DON'T TREAT THIS AS THE BLACK BOX WHERE SEQUENCE COMES IN, RESULTS COME
OUT YOU JUST TRUST THEM. THE THING IS WHAT I'M TELLING YOU, USE THEM,
IN AN INTELLIGENT FASHION, INSPECT THE RESULTS TO MAKE SURE THAT WHAT BASED ON WHAT I'VE
TOLD YOU OVER THE LAST TWO DECK TOURS THEY MAKE SENSE FROM DARK DASH LECTURES, THEY MAKE
SENSE. WHEN YOU PUT THOSE TWO THINGS TOGETHER THAT
WILL ALWAYS SERVE YOU INCREDIBLY WELL. I LEAVE YOU WITH THAT.
REMINDER ABOUT NEXT WEEK'S LECTURE. WE'RE GOING TO MOVE TO A MORE DECIDEDLY GENOMIC
POINT OF VIEW NEXT WEEK WHERE WILL GIVE YOU INFORMATION AND GIVING YOU VERY NICE OVERVIEW
OF HOW TO USE THE VARIOUS GENOME BROWSERS TO MINE GENOMIC DATA.
I'LL BE HAPPY TO TIE ANY QUESTIONS, THANKS ONCE AGAIN FOR COMING.