Teach Computer Science
The Database Concept
Ks3 computer science.
11-14 Years Old
48 modules covering EVERY Computer Science topic needed for KS3 level.
GCSE Computer Science
14-16 Years Old
45 modules covering EVERY Computer Science topic needed for GCSE level.
A-Level Computer Science
16-18 Years Old
66 modules covering EVERY Computer Science topic needed for A-Level.
KS3 Databases Resources (14-16 years)
- An editable PowerPoint lesson presentation
- Editable revision handouts
- A glossary which covers the key terminologies of the module
- Topic mindmaps for visualising the key concepts
- Printable flashcards to help students engage active recall and confidence-based repetition
- A quiz with accompanying answer key to test knowledge and understanding of the module
A-Level Introduction to Databases (16-18 years)
Candidates should be able to:
- describe a database as a persistent organised store of data
- explain the use of data handling software to create, maintain and interrogate a database.
What is a database?
A database is a persistent , organised store of related data.
- A database is persistent because the data and structures are stored in secondary storage , even when the applications that use the data are no longer running.
- A database is organised because the data is stored in a very structured way, using tables , records and fields so that users and data handling applications can easily add, delete, edit, search and manipulate the data.
- A database is made up of related data because the individual items of data have a connection of some sort.
An address book, an encyclopaedia and a telephone directory are examples of paper-based manual databases . However, it is more common to talk about computerised databases . Computerised databases have several advantages over manual databases. These include:
- the ability for the data to be accessed by more than one person at the same time
- the ability to interrogate or query the data and view the resulting answers
- the ability for changes to the data to be made quickly available to all end users
- the reduction of errors in repetitive tasks due to the processing accuracy of data handling software
- the output of data in a range of different formats to suit user needs (e.g. graphs, reports, forms, etc.), either for viewing on screen or as print-outs
A computerised database is a collection of related data stored in one or more computerised files in a manner that can be accessed by users or computer programs.
Most computerised databases are operational databases, meaning that data going into the database is used in real time to support the ongoing activities of a business. A supermarket accounting system is an example: as items are sold, the inventory database is updated and the inventory information is made available to the sales staff.
Computers have the ability to store large amounts of data in a compact space and to process it speedily. Organisations of all sizes use databases to store, sort, interrogate and manage their data. Below are a few examples:
Hospital databases maintain details of patients, doctors and treatments.
The databases manage and co-ordinate admissions, consultations, treatments, staffing and stock control. Business use databases to keep track of sales, stock and staff, etc. and to analyse their own performance Businesses use databases to keep track of sales, stock and staff etc. and to analyse their own performance.
Databases also help businesses to monitor trends in customers’ purchases. This helps businesses identify market opportunities.
Internet Search engines, such as Google, Bing, Yahoo, etc. all have powerful databases behind the scenes to collect the details of websites that are used in searches.
What is data handling software?
Any software designed to create , maintain and interrogate computerised databases is termed data handling software . Data handling software can therefore range from a simple program that creates and maintains a specific comma-delimited flat-file database through to sophisticated relational database management systems that can be used to create and manage a huge variety of database structures.
How is data management software used to create a database?
Database creation involves using software to define and build the structures to hold the data. In a database file the data is structured in a particular way.
- A single item of data is stored in a named FIELD
- A complete set of fields makes up a RECORD , the KEY FIELD contains data unique to that record
- All the records on one ENTITY are stored in a TABLE
- One or more tables then make up the database FILE
Database creation involves the following steps:
- Each field would be created, selecting a data type to match the data to be stored.
- An existing field is set as the key field or a field is created for this purpose.
- Once the complete set of fields have been created and any validation rules added, they are saved as a table .
- Data is then entered into the database fields, each complete set of fields forming a single record with a unique entry in the KEY FIELD.
For example, in a database of students;
- A TABLE would store all the data on all the students
- An individual RECORD would store the data on a single student
- Several FIELDS would store the data (attributes) of the student such as Student ID, Forename, Surname etc.
- A KEY FIELD such as ‘StudentID’ can store a unique number to identify that student.
This database FILE would contain just one table and is known as a flat-file database . There are a number of limitations to such databases and a relational database which contains multiple linked tables offers many advantages.
How is data management software used to maintain a database?
Database maintenance involves the following:
- adding (also referred to as inserting) new data records (for example, when a new member of staff joins a company or a new product is added to the stock in a warehouse).
- deleting existing data records (for example, when a member of staff leaves a company or a product discontinued from the stock in a warehouse).
- updating (editing) existing data items within existing data records (for example, when a member of staff changes their name or a product has a price change).
How is the data in a data files actually stored?
The data in a database can be physically stored in different ways, each offering particular advantages and disadvantages.
Serial data files
In a serial data file each record is stored in series, one after the other and there are no particular order to the records.
In this type of file structure the computer has to read through the data record-by-record until it finds the record that is needs to access. This makes accessing data from a serial file relatively slow .
If a record is deleted or edited then the complete altered file is re-written back to the storage medium which is relatively slow and may involve writing to a temporary file until the process is completed. The original file is then replaced by the altered temporary file.
Sequential data files
In a sequential data file the data the records are still in series but they are stored in order, using one of the records in the database. This makes if much easier to locate a particular record using an algorithm such as a binary sort.
An alternative an indexed sequential file . Here the position of each record is stored in an index which is a separate sub-file. This allows the computer to quickly access any record by looking it up in the index first and then going directly to the correct location.
- Serial data files are slow to access particular records within the file.
- Sequential data files allow faster access to particular records, either using the fact that the data is sorted or indexed.
How does data management software interrogate a database?
Database interrogation involves using the database management software to query (search) the database for information.
There are many reasons why users may wish to query data, including:
- To identify a group of records that share a certain attribute – e.g. a list of students with nut allergies, products from a particular supplier, etc.
- To calculate totals based on the information held in records – e.g. calculating the total value of the assets held by a company.
- To update the details of a specific record or group of records
The above list shows that queries are a means of producing information from data . This information is used by the decision makers in organisations to plan strategies and tactics. Databases usually allow users to create, save and then reuse queries.
A query design specifies which records the user is searching for and what fields to display out of those records. There are two types of query:
- Simple query – looking for data in one field only (for example – for example a user of a car showroom database could run a simple query to find out how may Toyota cars there are in stock).
- Complex query – looking for data in multiple fields (for example a user of a car showroom database could run a complex query to find all hatchback or saloon Toyota cars registered between August 2006 and July 2010 but not with 5 doors).
- Further information on queries.
- The steps in Creating a Database – drag-and-drop exercise
- Data entry checks
- Access database validation 1
- Access database validation 2
Introduction to SQL
What is a database definition, types and components.
- What is SQL and how to get started with it?
- SQL Basics – One Stop Solution for Beginners
- What are SQL Operators and how do they work?
- Understanding SQL Data Types – All You Need To Know About SQL Data Types
- SQL Tutorial : One Stop Solution to Learn SQL
- DBMS Tutorial : A Complete Crash Course on DBMS
- CREATE TABLE in SQL – Everything You Need To Know About Creating Tables in SQL
- What is a Schema in SQL and how to create it?
- What is a Cursor in SQL and how to implement it?
- Top 10 Reasons Why You Should Learn SQL
- Learn how to use SQL SELECT with examples
- SQL Functions: How to write a Function in SQL?
- What is SQL Regex and how to implement it?
- SQL UPDATE : Learn How To Update Values In A Table
- SQL Union – A Comprehensive Guide on the UNION Operator
What are Triggers in SQL and how to implement them?
- INSERT Query SQL – All You Need to Know about the INSERT statement
- How To Use Alter Table Statement In SQL?
- What is Normalization in SQL and what are its types?
How to perform IF statement in SQL?
- What are SQL constraints and its different types?
- Learn How To Use CASE Statement In SQL
- Primary Key In SQL : Everything You Need To Know About Primary Key Operations
- Foreign Key SQL : Everything You Need To Know About Foreign Key Operations
- SQL Commands - A Beginner's Guide To SQL
- How To Rename a Column Name in SQL?
- How to retrieve a set of characters using SUBSTRING in SQL?
- What is the use of SQL GROUP BY statement?
- How To Use ORDER BY Clause In SQL?
- How to use Auto Increment in SQL?
Everything You Need to Know About LIKE Operator in SQL
- What is an index in SQL?
- Understanding SQL Joins – All You Need To Know About SQL Joins
- Differences Between SQL & NoSQL Databases – MySQL & MongoDB Comparison
- What is Database Testing and How to Perform it?
SQL Pivot – Know how to convert rows to columns
Introduction to mysql.
- What is MySQL? – An Introduction To Database Management Systems
- How To Install MySQL on Windows 10? – Your One Stop Solution To Install MySQL
- MySQL Tutorial - A Beginner's Guide To Learn MySQL
- MySQL Data Types – An Overview Of The Data Types In MySQL
- How To Use CASE Statement in MySQL?
- What is the use of DECODE function in SQL?
What are basic MongoDB commands and how to use them?
- SSIS Tutorial For Beginners: Why, What and How?
- Learn About How To Use SQL Server Management Studio
- SQLite Tutorial: Everything You Need To Know
- What is SQLite browser and how to use it?
- MySQL Workbench Tutorial – A Comprehensive Guide To The RDBMS Tool
- PostgreSQL Tutorial For Beginners – All You Need To Know About PostgreSQL
PL/SQL Tutorial : Everything You Need To Know About PL/SQL
- Learn How To Handle Exceptions In PL/SQL
SQL Interview Questions
- Top 115 SQL Interview Questions You Must Prepare In 2024
- Top 50 MySQL Interview Questions You Must Prepare In 2024
- Top 50 DBMS Interview Questions You Need to know in 2024
Data is information and to organize this data, you require a Database . This article on What is a Database will help you understand the definition, the different types, their advantages and disadvantages.
Following topics are covered:
What is Data?
What is a database, database components.
- Facts about Database
- What are the types of Databases
Database Management System (DBMS)
What is sql, disadvantages.
So, let’s begin!
Data is a collection of a distinct unit of information. This “data” is used in a variety of forms of text, numbers, media and many more. Talking in terms of computing. Data is basically information that can be translated into a particular form for efficient movement and processing.
Example : Name, age, weight, height, etc.
Now, let’s move on to the next topic and understand what is a Database.
In layman terms, consider your school registry. All the details of the students are entered in a single file. You get the details regarding the students in this file. This is called a Database where you can access the information of any student.
Facts about Database:
- Databases have evolved dramatically since their inception in the early 1960s.
- Some Navigational databases such as the Hierarchical database and the Network database were the original systems used to store and manipulate data. Although these early systems were actually inflexible
- In the early 1980s, Relational databases became very popular, which was followed by object-oriented databases later on.
- More recently, NoSQL databases came up as a response to the growth of the internet and the need for faster speed and processing of unstructured data.
- Today, we have cloud databases and self-driving databases that are creating a new ground when it comes to how data is collected, stored, managed, and utilized.
Note: Data is interchangeable.
Let’s see how to create a Database.
How to Create a database?
We use the CREATE DATABASE statement to create a new database.
So the database of name College will be created.
This is how simple you can create a Database.
Find out our MS SQL Course in Top Cities
The major components of the Database are:
This consists of a set of physical electronic devices such as I/O devices, storage devices and many more. It also provides an interface between computers and real-world systems.
This is the set of programs that are used to control and manage the overall Database. It also includes the DBMS software itself. The Operating System, the network software being used to share the data among the users, the application programs used to access data in the DBMS.
Database Management System collects, stores, processes, and accesses data. The Database holds both the actual or operational data and the metadata.
These are the rules and instructions on how to use the Database in order to design and run the DBMS, to guide the users that operate and manage it.
- Database Access Language
It is used to access the data to and from the database. In order to enter new data, updating, or retrieving requires data from databases. You can write a set of appropriate commands in the database access language, submit these to the DBMS, which then processes the data and generates it, displays a set of results into a user-readable form.
Now that you guys have understood how to create a database, let’s move ahead and understand the types.
What are the Types of Databases
There are a few types that are very important and popular.
- Relational Database
- Object-Oriented Database
- Distributed Database
- NoSQL Database
- Graph Database
- Cloud Database
- Centralization Database
- Operational Database
These are the major types of Databases available. Now, let’s move on to the next topic.
A Database Management System (DBMS) is a software that is used to manage the Database. It receives instruction from a Database Administrator (DBA) and accordingly instructs the system to make the corresponding changes. These commands are used to load, retrieve or modify existing data from the system.
A database typically requires a comprehensive Database software program known as a Database Management System (DBMS). A DBMS basically serves as an interface between the database and its end-users or programs, allowing users to retrieve, update, and manage how the information is organized and optimized. A DBMS also facilitates oversight and control of databases, enabling a variety of administrative operations such as performance monitoring, tuning, and backup and recovery.
Structured Query language SQL is pronounced as “S-Q-L” or sometimes as “See-Quel” which is the standard language for dealing with Relational Databases . You can even check out the details of relational databases, functions, queries, variables, etc with the Microsoft SQL Certification .
It is effectively used to insert , search, update, delete, modify database records. It doesn’t mean SQL cannot do things beyond that. In fact, it can do a lot more other things as well. SQL is regularly used not only by database administrators but also by the developers to write data integration scripts and data analysts .
Now that you guys have understood what is SQL, let’s move on and understand the advantages of using the Database.
- Reduced data redundancy.
- Also, there is reduced updating errors and increased consistency.
- Easier data integrity from application programs.
- Improved data access to users through the use of host and query languages.
- Data security is also improved .
- Reduced data entry, storage, and retrieval costs.
- Complexity: Databases are complex hardware and software systems.
- Cost: It requires significant upfront and ongoing financial resources.
- Security: Most leading companies need to know that their Database systems can securely store data, including sensitive employee and customer information.
- Compatibility: There is a risk that a DBMS might not be compatible with a company’s operational requirements.
With this, we come to the end of this article on “What is a Database”. I hope you enjoyed reading it.
If you wish to learn more about MySQL and get to know this open-source relational database, then check out our MySQL DBA Certification Training which comes with instructor-led live training and real-life project experience. This training will help you understand MySQL in-depth and help you achieve mastery over the subject.
Got a question for us? Please mention it in the comments section of ” What is a Database ” and I will get back to you.
Recommended videos for you
Build application with mongodb, introduction to mongodb, recommended blogs for you, top 50 sql server interview questions you must prepare in 2024, what is dbms – a comprehensive guide to database management systems, mysql workbench tutorial – a comprehensive guide to the rdbms tool, sql union – a comprehensive guide on the union operator, top 30 sql query interview questions you must practice in 2024, choosing the right nosql database, how to install mysql on windows 10 – your one stop solution to install mysql, development and production of mongodb, top 5 reasons to learn cassandra decoded, introduction to column family with cassandra, apache cassandra advantages, introduction to cassandra architecture, mongodb® with hadoop and related big data technologies, learn about concatenate in sql with examples, top 50 oracle interview questions you should master in 2024, join the discussion cancel reply, trending courses in databases, microsoft sql server certification course.
- 3k Enrolled Learners
SQL Essentials Training
- 12k Enrolled Learners
MongoDB Certification Training Course
- 17k Enrolled Learners
MySQL DBA Certification Training
- 7k Enrolled Learners
Teradata Certification Training
Apache cassandra certification training.
- 13k Enrolled Learners
Subscribe to our newsletter, and get personalized recommendations..
Already have an account? Sign in .
20,00,000 learners love us! Get personalised resources in your inbox.
At least 1 upper-case and 1 lower-case letter
Minimum 8 characters and Maximum 50 characters
We have recieved your contact details.
You will recieve an email from us shortly.
Sql (structured query language) tutorial index.
A database intends to have a collection of data stored together to serve as multiple applications as possible. Hence a database is often conceived of as a repository of information needed for running certain functions in a corporation or organization. Such a database would permit not only the retrieval of data but also the continuous modification of data needed for the control of operations. It may be possible to search the database to obtain answers to queries or information for planning purposes.
Purpose of Database
A database should be a repository of data needed for an organization's data processing. That data should be accurate, private, and protected from damage. It should be accurate so that diverse applications with different data requirements can employ the data. Different application programmers and various end-users have different views upon data, which must be derived from a common overall data structure. Their methods of searching and accessing data will be different.
Advantages of Using Database
- Database minimizes data redundancy to a great extent.
- The database can control the inconsistency of data to a large extent.
- Sharing of data is also possible using the database.
- Database enforce standards.
- The use of Databases can ensure data security.
- Integrity can be managed using the database.
Various Levels of Database Implementation
The database is implemented through three general levels. These levels are:
- Internal Level or Physical level
- Conceptual Level
- External Level or View Level
The Concept of Data Independence
As the database may be viewed through three levels of abstraction, any change at any level can affect other levels' schemas. Since the database keeps on growing, then there may be frequent changes at times. This should not lead to redesigning and re-implementation of the database. The concepts of data independence prove beneficial in such types of contexts.
- Physical data independence
- Logical data independence
Basic Terminologies Related to Database and SQL
Relation : In general, a relation is a table, i.e., data is arranged in rows and columns. A relation has the following properties:
- In any given column of a table, all the items are of the same kind, whereas items in different columns may not be of the same kind.
- For a row, each column must have an atomic value, and also, for a row, a column cannot have more than one value.
- All rows of a relation are distinct.
- The ordering of rows in a relationship is immaterial.
- The column of a relation are assigned distinct names, and the ordering of these columns is immaterial.
Tuple : The rows of tables in a relationship are generally termed Tuples.
Attributes : The columns or fields of a table are termed Attributes.
Degree : The number of attributes in a relation determines the degree of the relation. A relation having three attributes is said to have a relation of degree 3.
Cardinality : The number of tuples or rows in a relation is termed cardinality.
Database Basics: Concepts & Examples for Beginners
Get started with relational databases by understanding organized data storage, an overview of management and analytics, and how it all relates to spreadsheets!
Let's get started...
The importance of well-presented data cannot be understated in today’s digitally advanced landscape. Companies around the world are focusing their entire strategies based on data, so they can understand their customers well. Facebook, Amazon, Netflix, and Google are just some of the large corporations whose business model revolves around providing personalized recommendations to their users. This has been made possible only through organized data.
So, what is organized data?
Organized data can be any representation of data that allows you to gather insights. What’s more necessary is that it should be relevant to your department. If you work at an insurance firm, you’ll want to have information that includes customer credit history, age, bank records, etc. What you won’t be concerned with is their favorite TV show or what books they like to read.
All data is powerful, you just have to make sure that you’re dealing with something that concerns your end goal.
Sometimes you will need to tackle multiple datasets together to form useful insights. When multiple datasets are concerned, things can get complicated very easily and it can become time-consuming to constantly move back and forth between heaps of data.
Database Basics: What is a Database?
The most efficient way to store data is with the help of a database. A database is made up of tables that contain columns and rows. Each category is given its's own table. For example, a company may have a table for customer information and another for sales numbers. You can think of a table somewhat like a spreadhseet. Inside a spreadsheet there are columns and rows of data. For a database however each row is called a record and each cell is called a field.
When people talk about a database they are usually referring to a relational database. This is the oldest type of database and has been used for over 40 years.
A relational database consists of 3 high-level components:
With these assets, you can easily link zettabytes (1,000,000,000,000,000,000,000 bytes) of data into something meaningful, that can easily be traversed at will to see everything you could want to.
Do you want to look at a specific segment of your data? Easily doable. Do you want to look at ONE particular result from a set of millions? No problem. How about looking at those 27 anomalies in your data that could be interesting to observe? Relational databases will always have your back. The flexibility that comes from having a relational database is unparalleled. Nothing has come close to being as mainstream and useful as relational databases and for good reason.
Let’s now discuss each of those 3 components in detail to make sense of what they are.
Tables are the Microsoft Excel equivalent of a single spreadsheet. They can also be classified as standalone datasets. Tables are used to organize the most closely related data together. A very basic example of a table could be a dataset about people that contains a bunch of people’s names, job titles, manager numbers, hiring dates, salaries, and commissions.
This information would be stored in a column and row format. Rows and columns also happen to be the very foundation of a table.
Where columns are used to store different information about one person, rows store information about different people. With both of them paired together, it ends up becoming a table full of information. Let’s discuss both of them in more detail.
Columns are used to differentiate the information we have on a single observable entity. In a Table that contains information about people, the columns would be used to hold different information. If a Table, as mentioned above, contains people’s names, job titles, manager numbers, hiring dates, salaries, and commissions, then that table will have 6 columns plus a Primary Key column that we will discuss in later sections.
Each column can be set up to allow only a specific type of information to be entered into it. This aspect allows for much-needed data integrity. For example, a column about salary should only contain numbers, right? While that is true, the people operating the databases are humans and can therefore accidentally enter something else in it. To prevent this from happening, columns can be designed to only let a specific type of information to be entered.
The same goes for an email column. Anything that does not end in the typical ‘@abc.com’ should not be allowed inside that column.
The customization that goes into a column is pretty much endless. There are many presets available and custom options too.
Rows of a table represent the number of observable entities we are looking at. To put it simply, if the people table has 3 rows, it means it has the data of 3 different people. Each row represents an individual person and the columns will display their respective information.
Rows allow us to see individual entries in the table. Each row also contains a Primary Key that allows us to search for individual entries with ease.
Keys allow unique identification for all rows in the table. Without keys there would be no way to differentiate between entries that have identical information in their columns. Two people in a table can have the same names and birthdays and without a unique key, it will be hard to differentiate between them and can lead to unnecessary confusion.
Suppose you’re an HR person who has to send a termination letter to a guy named John Doe and a promotion letter to another person with the same name. Imagine if that gets mixed up, both receive the termination or promotion letter. Talk about a corporate nightmare, right?
There are two types of keys you should know: a primary key and a foreign key.
Primary Keys are how every row in the table is searchable. They can be a single column or a combination of columns that make up a unique identification number.
Foreign Keys are used to link tables together within a database. These links are called relationships.
This is the part where the link between various tables starts to develop. Relationships allow a multitude of tables to contain different, but related kinds of information, while at the same time maintaining readability and optimizing space.
Imagine a small company that has different sections and departments for its employees like an insurance fund, a daycare center, an electronic attendance register.
While all this information may be useful, reading all of it together won’t be. If the HR department would like to see the insurance information of a specific employee, they will not be concerned with the use of the daycare center. In fact, it will only become harder to read with so many columns in place.
Storing the databases is also not a superficial issue as they can demand a ton of space once they start to grow in size. It’s not optimal for every computer in the company to have the entire database in terms of storage and security issues.
To tackle this, relationships are implemented between various tables. Relationships essentially allow splitting up of information into useful components that emphasize readability and efficiency. This also means that different departments will only have access to what they need and the rest will not be available on their computers.
Separating the important information
In the above small company example, the best way to form relationships would be to have 4 tables. Each to represent employees, the insurance fund, the daycare center, and the attendance register.
Now, the information is split quite nicely. The next step is to identify what information would be present in most areas. In this case, everything we split up would be used by employees so it makes sense to consider it as our main table.
Think about what sort of information an employee table should have from your own experience of working in an office. Every staff member has the basic name, address, phone, email, and age columns along with something known as an ID number. That ID number is unique to each worker and it also serves as the Primary Key of the Employee table.
Using the ID number, we can search for information on any employee we want, even if we have employees with the same name. If by some odd miracle all your staff have the same name and age, you can still tell them apart by their ID number. Does that mean we can use the Employee ID to somehow link to other tables? Absolutely.
Forming relations with other tables through Foreign Keys
The Employee ID column from that table will be used as the Primary Key for the other tables. When other tables are linked this way, they are said to have a Foreign Key. It simply means a key from outside. This Foreign Key lets us link all the other information to each worker. All the information we have in other tables is tied to specific people, therefore having the Employee ID in every other table makes perfect sense.
While many technicalities decide how relationships are made in complex systems, this is one of the most simple examples that you can find in any relational database book.
When discussing relational databases, some of the most common terms you’ll hear are SQL and synonyms for other properties. Columns may also be known as attributes, fields, or features. Rows may also be called records, entries, or tuples.
SQL is a programming language that was designed to make databases easier to work with. The power of this tool cannot be understated. With a strong grasp on its core concepts, you can do pretty much anything you want with the data when it comes to insights. Most commonly, SQL is used to extract (or query) data from the database. With this language, you can specify what data you want and what the output should look like. This is how you can take data from a database into Microsoft Excel or Google Sheets.
How are relational databases different from Excel/Google Spreadsheets?
On a very small scale, spreadsheet programs can work well. But the minute you start to think about scalability, security, and usability, spreadsheets no longer suffice.
This deals with having the data available to more than one person. While you may be able to work with giving access to that online spreadsheet to a few people, it simply can’t work when you need many different departments looking into it every day. The chances of two people working on the same thing are not negligible and can lead to serious problems. If you’ve ever worked with another person on Google Docs at the same time, chances are you’ve experienced small hiccups.
Data is precious and giving all employees access to it can result in disaster. Not all employees need all the data and some of it needs to be kept confidential. For example, Amazon has a lot of sensitive customer data: addresses, phone numbers, credit card information, etc. They have over 50,000 corporate employees (over 700,000 corporate and non-corporate), but most do not have access to this information. Amazon uses databases to restrict access and protect customers.
If you have a ton of data but can’t find an efficient way to get insights from it, it’s pretty much useless. Tools other than relational databases just don’t have that kind of power to extract such meaningful information that SQL and other database languages possess.
As your data increases, you might have to shift to a relational database forcefully. Spreadsheets can only handle so much data. Google Spreadsheets has a limit of 5 million cells and in today’s time, that really isn’t a lot.
It won't take long for that to be filled up and migrating with 5 million cells to a database is going to be troublesome. It’s recommended you start your migration the moment you observe growth. Databases can handle as much you can throw at them and this is why you seem them implemented anywhere data exists.
The flexibility that comes from having a relational database is unparalleled. Nothing has come close to being as mainstream and useful as relational databases and for good reason.
To recap, we explored why organized data is important and how you can organize data with a relational database. A relational database consists of multiple data tables linked together through keys and relationships.
Tables, keys, and relationships are the three core components of a relational database. Tables are made up of rows and columns. Rows represent individual entities in a table where columns represent their attributes. Keys (primary and foreign) are one of the key concepts of what makes relational databases work. Relationships between tables are the link that makes the data much more meaningful. They explain how things are actually connected and what connects them.
Without keys and relationships linking tables together, there is no significant difference between multiple spreadsheets and a relational database.
Finally, we reviewed common database jargon that you should get familiar with. It’s mostly synonyms of other things, but it can help to know what’s being discussed.
- Relational Databases and Non-Relational Databases
- Cloud Databases
- Unstructured Data
How to Hire a Virtual Assistant (The Ultimate Step-By-Step Guide)
Short-form or long-form content, what is etl (basic guide + faqs), lead generation basics, turn your spreadsheet into software, join our mailing list.
Every weekend we send an email with the latest from Lido. Including events, product updates, SDK releases, and more.
Sending Emails From a Lido Spreadsheet is Easy!
- Easily Send Customized Messages to a List of E-Mails
- Import E-Mails from Gmail, Mailchimp, CSVs and More!
- Automatically Send Emails When Cell Values Change
- Track E-Mail Campaigns from your Spreadsheet
Important Questions and Notes
Database Concepts Class 11 Notes Important Points
Database concepts class 11 notes.
Manual Record Keeping System :
A System where records are maintained by hand, without using a computer system .
Advantages of Manual Record Keeping System :
- It is less expensive.
- Less risk of data loss.
- No software specialised person is required.
Disadvantages of Manual Record Keeping System :
- No sharing of data.
- More chances of inconsistent data.
- Making correction is very time consuming.
Electronic Record Keeping System :
A System in which records are maintained in computer system instead of in paper.
Advantages of Electronic Record Keeping System :
- Less paper wastage.
- Searching of record is very simple.
- Easy to backup the documents.
Disadvantages of Electronic Record Keeping System :
- More expensive.
- More risk to data loss.
- A Software specialised person is required to manage this system.
Database Management System :
A database management system (DBMS) is a software that can be used to create and manage databases. Some examples of open source and commercial DBMS include MySQL, Oracle, PostgreSQL, SQL Server, Microsoft Access, MongoDB etc.
Databases are widely used in various fields. Some applications are given below :
Common Terms used in DBMS :
Attributes : The columns of a relation are the attributes which are also referred as fields. for example : In the table “Student” given below, there are four attributes.
Tuple : Each row of data in a relation (table) is called a tuple. It is also known as record. for example In the table “Student” given above, there are two tuples.
Domain : It is a set of values from which an attribute can take a value in each row. Usually, a data type is used to specify domain for an attribute. For example, in “Student” relation given above, the attribute Roll_no takes integer values and hence its domain is a set of integer values.
Degree : The number of attributes in a relation is called the Degree of the relation. For example, the relation “Student” given below with four attributes is a relation of degree 4.
Cardinality : The number of tuples in a relation is called the Cardinality of the relation. For example, the cardinality of relation “Student” is 2 as there are 2 tuples in the table.
Key Concepts in DBMS :
Database Schema : It is the skeleton of the database that represents the structure (table names and their fields/columns), the type of data each column can hold, constraints on the data to be stored (if any), and the relationships among the tables.
Data Constraint : Certain restrictions or limitations on the type of data that can be inserted in one or more columns of a table during table creation is called data constraint. Constraints are used to ensure accuracy and reliability of data in the database.
Meta-data or Data Dictionary : The database schema along with various constraints on the data is stored by DBMS in a database catalog or dictionary, called meta-data. A meta-data is data about the data.
Database Instance : When we define database structure or schema, state of database is empty i.e. no data entry is there. After loading data, the state or snapshot of the database at any given time is the database instance.
Query : A query is a request to a database for obtaining information in a desired way. Query can be made to get data from one table or from a combination of tables.
Data Manipulation : Modification of database consists of three operations viz. Insertion, Deletion or Update. Insertion means adding a new record in a table. Deletion means removing an existing record from a table. Updation means editing an existing record in a table.
Database Engine : Database engine is the underlying component or set of programs used by a DBMS to create database and handle various queries for data retrieval and manipulation.
Three Important Properties of a Relation :
In relational data model, following three properties are observed with respect to a relation which makes a relation different from a data file or a simple table.
Property-1 : imposes following rules on an attribute of the relation.
- Each attribute in a relation has a unique name.
- Sequence of attributes in a relation is immaterial.
Property-2 : imposes following rules on tuple of the relation.
- Each tuple in a relation is distinct.
- Sequence of tuples in a relation is immaterial.
Property-3 : imposes following rules on the state of a relation.
- All data values in an attribute must be from the same domain (same data type).
- Each data value associated with an attribute must be atomic.
- No attribute can have many data values in one tuple.
- A special value “NULL” is used to represent values that are unknown.
Keys in Relational Database :
Candidate Key : Those fields which can act as a primary key in a table are called Primary Key.
Primary Key : A field which uniquely identifies each and every record in table is called primary key.
Composite Primary Key : If no single attribute in a relation is able to uniquely identifies the tuples, then more than one attribute are taken together as primary key. Such primary key consisting of more than one attribute is called Composite Primary key.
Foreign Key : A foreign key is used to represent the relationship between two relations. A foreign key is an attribute whose value is derived from the primary key of another relation.
In some cases, foreign key can take NULL value if it is not the part of primary key of the foreign table. The relation in which the referenced primary key is defined is called primary relation or master relation.
Chapter Wise MCQ
1. Functions in Python
2. Flow of Control (Loop and Conditional statement)
3. 140+ MCQ on Introduction to Python
4. 120 MCQ on String in Python
5. 100+ MCQ on List in Python
6. 50+ MCQ on Tuple in Python
7. 100+ MCQ on Flow of Control in Python
8. 60+ MCQ on Dictionary in Python
100 practice questions on python fundamentals, 120+ mysql practice questions, 90+ practice questions on list, 50+ output based practice questions, 100 practice questions on string, 70 practice questions on loops, 70 practice questions on if-else.
Disclaimer : I tried to give you the correct Handouts of ” Database Concepts Class 11 Notes ” , but if you feel that there is/are mistakes in the Handouts of “ Database Concepts Class 11 Notes “ given above, you can directly contact me at [email protected]. This study material and screenshot is taken from CBSE content.
Leave a Reply Cancel reply
- Online Degree Explore Bachelor’s & Master’s degrees
- MasterTrack™ Earn credit towards a Master’s degree
- University Certificates Advance your career with graduate-level learning
- Top Courses
- Join for Free
Introduction to Databases
This course is part of Meta Database Engineer Professional Certificate
Taught in English
Instructor: Taught by Meta Staff
Financial aid available
43,416 already enrolled
You do not need prior database experience. Only basic internet navigation skills and an eagerness to get started with coding.
What you'll learn
Concepts and principles that underpin how databases work .
Plan and execute a simple database development project .
Skills you'll gain
- Database (DBMS)
- database administration
Details to know
Add to your LinkedIn profile
Available in English
Subtitles: Arabic, German, Thai, Portuguese (Brazilian), Greek, English, Indonesian, French, Spanish
See how employees at top companies are mastering in-demand skills
Build your Data Management expertise
- Learn new concepts from industry experts
- Gain a foundational understanding of a subject or tool
- Develop job-relevant skills with hands-on projects
- Earn a shareable career certificate from Meta
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance review
There are 5 modules in this course
In this course, you will be introduced to databases and explore the modern ways in which they are used. Learn to distinguish between different types of database management systems then practice basic creation and data selection with the use of Structured Query Language (SQL) commands.
By the end of this course, you’ll be able to: - Demonstrate a working knowledge of the concepts and principles that underpin how databases work - Identify and explain the different types of core technology and management systems used in databases - Identify and interpret basic SQL statements and commands - Manipulate records in a database with the use of SQL statements and commands - Outline alternatives to SQL - and plan and design a simple relational database system You’ll also gain experience with the following: • Fundamental concepts in database • Basic MySQL syntax and commands • Database management systems • MySQL software • Relational databases
In this module, you’ll receive an introduction to the course and explore possible career roles that you could follow as a database engineer. You’ll also review some tips on how to take this course successfully and discuss what it is that you hope to learn. As part of your introduction, you’ll learn about the basics of databases and data and how they work. You’ll then receive an introduction to SQL, or Standard Query Language, the coding syntax used to interact with databases. Finally, you’ll explore the basic structure of databases and discover the different types of keys they use.
13 videos 10 readings 4 quizzes 1 discussion prompt
13 videos • Total 49 minutes
- Introduction to the program • 2 minutes • Preview module
- Introduction to databases • 3 minutes
- A day in the Life of a Database Engineer • 6 minutes
- What is a database? • 5 minutes
- How is data related? • 3 minutes
- Alternative types of databases • 4 minutes
- What is Structured Query Language? • 2 minutes
- SQL usage • 3 minutes
- Advantages of SQL • 2 minutes
- SQL syntax introduction • 4 minutes
- What are tables in databases? • 3 minutes
- Types of keys in a database table • 3 minutes
- Module summary: Introduction to Databases • 2 minutes
10 readings • Total 120 minutes
- Course syllabus: Introduction to databases • 10 minutes
- How to be successful in this course • 10 minutes
- Relational data example charts • 10 minutes
- Database Evolution • 15 minutes
- Additional resources • 10 minutes
- Common SQL Commands • 15 minutes
- Tables overview • 15 minutes
- Database structure overview • 15 minutes
4 quizzes • Total 75 minutes
- Knowledge check: Databases and data • 15 minutes
- Knowledge check: SQL syntax review • 15 minutes
- Knowledge check: Database structure • 15 minutes
- Module quiz: Introduction to Databases • 30 minutes
1 discussion prompt • Total 10 minutes
- What do you hope to learn? • 10 minutes
Create, Read, Update and Delete (CRUD) Operations
In this module, you’ll explore CRUD, or Create, Read Update and Delete operations. You’ll begin with an exploration of SQL data types and learn how to differentiate between numeric data, string data and default values. You’ll also embark upon several exercises in which you’ll learn how to utilize these different data types within your database projects. You’ll then move on to learn how to Create and Read data within a database. You’ll discover how to create databases and tables and populate them with data using SQL statements. Lastly, you’ll explore the SQL statements used for updating and deleting data in a database. And to demonstrate your ability with CRUD operations, you’ll complete exercises that will task you with creating and managing data.
12 videos 4 readings 10 quizzes 7 ungraded labs
12 videos • Total 41 minutes
- Numeric data types • 3 minutes • Preview module
- String data types • 3 minutes
- Default values • 4 minutes
- CREATE and DROP database • 2 minutes
- CREATE TABLE statement • 2 minutes
- ALTER TABLE statement • 3 minutes
- INSERT statement • 4 minutes
- SELECT statement • 3 minutes
- INSERT INTO SELECT statement • 3 minutes
- Updating data • 3 minutes
- Deleting data • 3 minutes
- Module summary: Create, Read, Update and Delete (CRUD) Operations • 3 minutes
4 readings • Total 45 minutes
- Additional resources • 15 minutes
- Creating tables • 15 minutes
- Additional resources • 5 minutes
10 quizzes • Total 156 minutes
- Self review: Working with numbers • 15 minutes
- Self review: Working with strings • 12 minutes
- Self review: Working with default values • 12 minutes
- Self review: Choosing the right data type for a column • 15 minutes
- Self-review: Create database, create table and insert data • 15 minutes
- Self review: Practicing table creation • 15 minutes
- Knowledge check: Create, insert and select • 15 minutes
- Self-review: Record deletion • 12 minutes
- Knowledge check: Update and Delete • 15 minutes
- Module quiz: Create, Read, Update and Delete (CRUD) Operations • 30 minutes
7 ungraded labs • Total 420 minutes
- Exercise: Working with numbers • 60 minutes
- Exercise: Working with strings • 60 minutes
- Working with default values • 60 minutes
- Choosing the right data type for a column • 60 minutes
- Exercise: Create Database, create table and insert data • 60 minutes
- Exercise: Practicing table creation • 60 minutes
- Exercise: Record deletion • 60 minutes
SQL Operators and sorting and filtering data
In this module, you’ll explore SQL operators and learn how to sort and filter data. You’ll begin this module with a lesson on SQL operators. As part of this first lesson, you’ll explore the syntax and process steps used to deploy SQL arithmetic and comparison operators within a database. Next, you’ll discover how to sort and filter data using clauses. The clauses that you’ll learn about include the Order By clause, Where clause and Select Distinct clause. In each lesson item, you’ll receive an overview of how each clause is used to sort and filter data in a database. You’ll also view demonstrations of these clauses and then receive an opportunity to try them for yourself.
7 videos 7 readings 3 quizzes 1 ungraded lab
7 videos • Total 34 minutes
- SQL Arithmetic Operators • 4 minutes • Preview module
- Operators in use • 4 minutes
- SQL Comparison operators • 5 minutes
- ORDER BY clause • 5 minutes
- WHERE clause • 7 minutes
- SELECT DISTINCT clause • 4 minutes
- Module summary: SQL operators and sorting and filtering data • 1 minute
7 readings • Total 155 minutes
- SQL Arithmetic Operator Examples • 30 minutes
- SQL Comparison operator examples • 30 minutes
- Types of ordering / sorting • 30 minutes
- WHERE Clause uses • 30 minutes
- SELECT DISTINCT clause in use • 15 minutes
3 quizzes • Total 72 minutes
- Knowledge Check: Operators • 30 minutes
- Self-review: ORDER BY and WHERE • 12 minutes
- Module quiz: SQL operators and sorting and filtering data • 30 minutes
1 ungraded lab • Total 60 minutes
- ORDER BY and WHERE • 60 minutes
In this module, you’ll learn about database design. In the first lesson, you’ll receive an overview of how to design a database schema. As part of this overview, you’ll learn about basic database design concepts like schema and find out about different types of schemas. The next lesson focuses on relational database design. In this lesson, you’ll explore how to establish relationships between tables in a database using keys. You’ll also learn about the different types of keys that are used in relational database design, such as primary keys and foreign keys.
12 videos 9 readings 6 quizzes 1 ungraded lab
12 videos • Total 47 minutes
- Database schema • 3 minutes • Preview module
- Schema in use • 4 minutes
- Types of database schema • 2 minutes
- Table relationships • 3 minutes
- Primary key • 2 minutes
- Foreign key • 4 minutes
- Finding entitites • 3 minutes
- What is database normalization? • 4 minutes
- First normal form 1NF • 4 minutes
- Second normal form 2NF • 6 minutes
- Third normal form 3NF • 4 minutes
- Module summary: Database design • 2 minutes
9 readings • Total 140 minutes
- Exploring database schema • 15 minutes
- Building a schema • 15 minutes
- Relational model • 15 minutes
- Keys in depth • 15 minutes
- Entity relationship diagrams (ERD) • 30 minutes
- Data normalization • 15 minutes
6 quizzes • Total 96 minutes
- Knowledge check: Database schema • 15 minutes
- Knowledge check: Defining keys • 15 minutes
- Database relations and keys • 15 minutes
- Knowledge Check: Database normalization • 12 minutes
- Self-review: Database schema examples • 15 minutes
- Module quiz: Database design • 24 minutes
- Database schema examples • 60 minutes
In this module, you’ll have an opportunity to recap what you learned and identify your strengths as well as target topics that you would like to revisit in this course.
2 videos 2 readings 1 quiz 1 discussion prompt
2 videos • Total 5 minutes
- Course Recap: Introduction to databases • 2 minutes • Preview module
- Congratulations, you have completed Introduction to databases • 2 minutes
2 readings • Total 8 minutes
- About the final graded quiz assessment • 3 minutes
- Next steps after Introduction to Databases • 5 minutes
1 quiz • Total 30 minutes
- Final graded quiz: Intro to databases • 30 minutes
- What are your thoughts on working with databases? • 10 minutes
We asked all learners to give feedback on our instructors based on the quality of their teaching style.
Meta builds technologies that help people connect with friends and family, find communities, and grow businesses. The Meta Professional Certificates create opportunities so that anyone regardless of education, background or experience can learn high-quality skills to land a high-growth career—no degree or experience required to get started. Meta also offers training courses on the metaverse to educate people, brands, businesses and professionals on the opportunities it presents and what it means for our world today and into the future.
Recommended if you're interested in Data Management
Database Structures and Management with MySQL
Meta Database Engineer
Database Engineer Capstone
Why people choose coursera for their career.
Showing 3 of 777
Reviewed on Jun 22, 2023
GREAT COURSE FOR BEGINERS HOW DOESN'T EVEN KNOW ABOUT DATABASES . THEORY WAS EXPLAINED IN A CLEAR WAY AND THE LAB SESSIONS HELPFUL .
Reviewed on Mar 10, 2023
In the Lab test, some of the instructions is not sync, i.e., the picture showed in the instruction page is contradictory to the requirement.
Reviewed on Sep 2, 2022
Well detailed course, I thought this topic was going to be hard to learn. But the teacher explained it well. Anyway, I enjoy this course, because I learned a lot from it. Thank you Coursera and Meta.
New to Data Management? Start here.
Open new doors with Coursera Plus
Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription
Advance your career with an online degree
Earn a degree from world-class universities - 100% online
Join over 3,400 global companies that choose Coursera for Business
Upskill your employees to excel in the digital economy
Frequently asked questions
When will i have access to the lectures and assignments.
Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:
The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.
The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.
What will I get if I subscribe to this Certificate?
When you enroll in the course, you get access to all of the courses in the Certificate, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.
What is the refund policy?
If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy Opens in a new tab .
- Overview of DBMS
- Components of DBMS
- Database Architecture
- Types of Database Model
DBMS - ER Model
- ER Model: Basic Concepts
- ER Model: Creating ER Diagram
- ER Model: Generalization and Specialization
DBMS - Relational Model
- Codd's 12 rule of RDBMS
- Basic Concepts of RDBMS
- Relational Algebra
- Relational Calculus
- ER Model to Relational Model
- Types of Database Key
- Database Normalization
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce-Codd Normal Form (BCNF)
- Fourth Normal Form (4NF)
- Fifth Normal Form (5NF)
- SQL Introduction
- Create query
- Alter query
- Truncate, Drop and Rename query
- INSERT command
- UPDATE command
- DELETE command
- All TCL Command
- All DCL Command
- SELECT query
- WHERE clause
- LIKE clause
- ORDER BY clause
- Group BY clause
- Having clause
- DISTINCT keyword
- AND & OR operator
- SQL Constraints
- SQL function
- SQL SET operation
- SQL Sequences
What is Data?
Data is nothing but facts and statistics stored or free flowing over a network, generally it's raw and unprocessed. For example: When you visit any website, they might store you IP address, that is data, in return they might add a cookie in your browser, marking you that you visited the website, that is data, your name, it's data, your age, it's data.
Data becomes information when it is processed, turning it into something meaningful. Like, based on the cookie data saved on user's browser, if a website can analyse that generally men of age 20-25 visit us more, that is information, derived from the data collected.
What is a Database?
A Database is a collection of related data organised in a way that data can be easily accessed, managed and updated. Database can be software based or hardware based, with one sole purpose, storing data.
During early computer days, data was collected and stored on tapes, which were mostly write-only, which means once data is stored on it, it can never be read again. They were slow and bulky, and soon computer scientists realised that they needed a better solution to this problem.
Larry Ellison , the co-founder of Oracle was amongst the first few, who realised the need for a software based Database Management System.
What is DBMS?
A DBMS is a software that allows creation, definition and manipulation of database, allowing users to store, process and analyse data easily. DBMS provides us with an interface or a tool, to perform various operations like creating database, storing data in it, updating data, creating tables in the database and a lot more.
DBMS also provides protection and security to the databases. It also maintains data consistency in case of multiple users.
Here are some examples of popular DBMS used these days:
- Amazon SimpleDB (cloud based) etc.
Characteristics of Database Management System
A database management system has following characteristics:
- Data stored into Tables: Data is never directly stored into the database. Data is stored into tables, created inside the database. DBMS also allows to have relationships between tables which makes the data more meaningful and connected. You can easily understand what type of data is stored where by looking at all the tables created in a database.
- Reduced Redundancy: In the modern world hard drives are very cheap, but earlier when hard drives were too expensive, unnecessary repetition of data in database was a big problem. But DBMS follows Normalisation which divides the data in such a way that repetition is minimum.
- Data Consistency: On Live data, i.e. data that is being continuosly updated and added, maintaining the consistency of data can become a challenge. But DBMS handles it all by itself.
- Support Multiple user and Concurrent Access: DBMS allows multiple users to work on it(update, insert, delete data) at the same time and still manages to maintain the data consistency.
- Query Language: DBMS provides users with a simple Query language, using which data can be easily fetched, inserted, deleted and updated in a database.
- Security: The DBMS also takes care of the security of data, protecting the data from un-authorised access. In a typical DBMS, we can create user accounts with different access permissions, using which we can easily secure our data by restricting user access.
- DBMS supports transactions , which allows us to better handle and manage data integrity in real world applications where multi-threading is extensively used.
Advantages of DBMS
- Segregation of applicaion program.
- Minimal data duplicacy or data redundancy.
- Easy retrieval of data using the Query Language.
- Reduced development time and maintainance need.
- With Cloud Datacenters, we now have Database Management Systems capable of storing almost infinite data.
- Seamless integration into the application programming languages which makes it very easier to add a database to almost any application or website.
Disadvantages of DBMS
- It's Complexity
- Except MySQL, which is open source, licensed DBMSs are generally costly.
- They are large in size.
- Next →
DBMS MCQ Tests
practice sql queries.
- Database Concepts Class 12 Notes
Teachers and Examiners ( CBSESkillEduction ) collaborated to create the Database Concepts Class 12 Notes . All the important Information are taken from the NCERT Textbook Information Technology (802) class 12 .
What is Database?
A database is an organized collection of data that has been arranged and is typically kept electronically in a computer system. A database management system often oversees a database (DBMS).
A database has the following properties: 1) A database is a representation of some aspect of the real world also called miniworld. Whenever there are changes in this miniworld they are also reflected in the database. 2) It is designed, built and populated with data for specific purpose. 3) It can be of any size and complexity. 4) It can be maintained manually or it may be computerized.
Need for a Database
Database management systems enable users to securely, efficiently, and quickly share data throughout an organization. A data management system offers quicker access to more accurate data by quickly responding to database requests.
Database approach –
Data Redundancy – Data redundancy is the storing of the same data across many files. Space would be wasted as a result of this.
Data Inconsistency – If a file is modified, all the files that contain comparable information must also be updated, or the data will become inconsistent.
Lack of Data Integration – Because data files are unique, it is very challenging to obtain information from various files.
Database Management System (DBMS)
Data is stored, retrieved, and analyzed using software called database management systems (DBMS). Users can create, read, update, and remove data in databases using a DBMS, which acts as an interface between them and the databases.
The various operations that need to be performed on a database are as follows –
1. Defining the Database – It involves specifying the data type of data that will be stored in the database and also any constraints on that data. 2. Populating the Database – It involves storing the data on some storage medium that is controlled by DBMS. 3. Manipulating the Database – It involves modifying the database, retrieving data or querying the database, generating reports from the database etc. 4. Sharing the Database – Allow multiple users to access the database at the same time. 5. Protecting the Database – It enables protection of the database from software/ hardware failures and unauthorized access. 6. Maintaining the Database – It is easy to adapt to the changing requirements. Some examples of DBMS are – MySQL, Oracle, DB2, IMS, IDS etc.
Characteristics of Database Management Systems
Self-describing Nature of a Database System – A database system is said to as self-describing if it has metadata that defines and explains the data and relationships between tables in the database in addition to the database itself.
Insulation Between Programs and Data – Programs that access this data don’t need to be changed because the description of the data is stored separately in the database management system (DBMS) and any changes to the data’s structure are made in the catalogue.
Sharing of Data – Multiple users can access the database. Therefore, a DBMS must have concurrency control software to provide concurrent access to the database’s data without encountering any consistency issues.
Types of Users of DBMS
Depending on their needs and how they interact with the DBMS, different types of users use the DBMS. Four main categories of users exist –
End Users – those who use the database to perform queries, make changes, and produce reports based on their requirements is a end users.
Database Administrator (DBA) – The DBA is incharge of authorising access, keeping an eye on how it’s being used, offering technical support, and acquiring hardware and software resources.
Application Programmers – To communicate with the database, application developers create application programmes. To communicate with the database, these programmes are created using high level languages like SQL.
System Analyst – A system analyst is important to the feasibility, technical, and economic elements of database architecture.
Advantages of using DBMS Approach
Following are the advantages of using a DBMS –
- Reduction in Redundancy – All the data is stored at one place. There is no repetition of the same data. This also reduces the cost of storing data on hard disks or other memory devices.
- Improved Consistency – The chances of data inconsistencies in a database are also reduced as there is a single copy of data that is accessed or updated by all the users.
- Improved Availability – Same information is made available to different users. This helps sharing of information by various users of the database.
- Improved Security – The DBA can protect the database by using passwords and restricting users’ database access rights.
- User Friendly – Because of its user-friendly interface, it reduces users’ dependence on computer specialists to carry out various data-related actions in a DBMS.
Limitations of using DBMS Approach
- High Cost – The cost of implementing a DBMS system is very high. It is also a very time consuming process.
- Security and Recovery Overheads – Depending on the data stored, unauthorised access to a database can result in a threat to the individual or business. Additionally, regular data backups are necessary to guard against disasters like fires and earthquakes.
A collection of data elements with pre-established relationships between them make up a relational database. These things are arranged in a series of tables with rows and columns. To store data about the things that will be represented in the database, tables are utilised.
In relational model,
- A row is called a Tuple.
- A column is called an Attribute.
- A table is called as a Relation.
- The data type of values in each column is called the Domain.
- The number of attributes in a relation is called the Degree of a relation.
- The number of rows in a relation is called the Cardinality of a relation.
Relational Model Constraints
Constraints are limitations on the values that are stored in a database according to the specifications.
We describe below various types of constraints in Relational model –
Domain Constraint – User-defined columns called domain constraints allow users to enter values in accordance with the data type. Additionally, if it receives an incorrect input, it alerts the user that the column needs to be filled out correctly.
Key Constraint – A primary key constraint is a column or group of columns that shares the same characteristics as a unique constraint. Relationships between tables can be specified using a primary key and foreign key constraints.
Null Value Constraint – A column may by default contain NULL values. A column must not accept NULL values according to the NOT NULL constraint. This forces a field to always have a value, thus you cannot add a value to this field while adding a new record or updating an existing record.
Entity Integrity Constraint – The primary key cannot be null due to the Entity Integrity Constraint. Individual records in a table are identified by a primary key, and if the primary key is null, we are unable to do so. Except for the main key column, any place in the table can have null values.
Referential Integrity Constraint – Foreign key constraints or referential integrity constraints. A logical rule governing the values in one or more columns in one or more tables is known as a foreign key constraint, also known as a referential constraint or a referential integrity constraint. For instance, a group of tables presents details about the suppliers to a company.
Data types commonly used
Structured query language (sql).
RDBMS data management is done using the SQL language. It is made up of two languages: Data Definition Language (DDL) and Data Manipulation Language (DML), where DDL is a language used to specify the structure and restrictions of data and DML is used to add, alter, and delete data in a database.
Create a Database –
CREATE DATABASE School;
Create Table Command –
CREATE TABLE<table name> ( <column 1><data type> [constraint] , <column 2><data type>[constraint], <column 3><data type>[constraint] );
Question > Write a Query to Create a new table where the field will be Teacher_ID, First_Name, Last_Name, Gender, Date_of_Birth, Salary, Dept_No.
CREATE TABLE Teacher ( Teacher_ID INTEGER, First_Name VARCHAR(20), Last_Name VARCHAR(20), Gender CHAR(1), Salary DECIMAL(10,2), Date_of_Birth DATE, Dept_No INTEGER );
Create Table using NOT NULL – An attribute value may not be permitted to be NULL.
CREATE TABLE TEACHER ( Teacher_ID INTEGER, First_NameVARCHAR(20) NOT NULL, Last_NameVARCHAR(20), Gender CHAR(1), Salary DECIMAL(10,2), Date_of_Birth DATE, Dept_No INTEGER );
Create Table using DEFAULT – If a user has not entered a value for an attribute, then default value specified while creating the table.
CREATE TABLE TEACHER ( Teacher_ID INTEGER, First_Name VARCHAR(20) NOT NULL, Last_Name VARCHAR(20), Gender CHAR(1), Salary DECIMAL(10,2) DEFAULT 40000, Date_of_Birth DATE, Dept_No INTEGER );
Create a Table using CHECK – In order to restrict the values of an attribute within a range, CHECK constraint may be used.
CREATE TABLE TEACHER ( Teacher_ID INTEGER, First_Name VARCHAR(20) NOT NULL, Last_Name VARCHAR(20), Gender CHAR(1), Salary DECIMAL(10,2) DEFAULT 40000, Date_of_Birth DATE, Dept_No INTEGER CHECK (Dept_No<=110) );
Create a Table using KEY CONSTRAINT – Primary Key of a table can be specified in two ways. If the primary key of the table consist of a single attribute, then the corresponding attribute can be declared primary key along with its description.
CREATE TABLE TEACHER ( Teacher_ID INTEGER PRIMARY KEY, First_Name VARCHAR(20) NOT NULL, Last_Name VARCHAR(20), Gender CHAR(1), Salary DECIMAL(10,2) DEFAULT 40000, Date_of_Birth DATE, Dept_No INTEGER );
Create a Table using REFERENTIAL INTEGRITY CONSTRAINT – This constraint is specified by using the foreign key clause.
CREATE TABLE Teacher ( Teacher_ID INTEGER PRIMARY KEY, First_Name VARCHAR(20) NOT NULL, Last_Name VARCHAR(20), Gender CHAR(1), Salary DECIMAL(10,2) DEFAULT 40000, Date_of_Birth DATE, Dept_No INTEGER, FOREIGN KEY (Dept_No) REFERENCES Department(Dept_ID) );
Naming of Constraint
In the Create Table command, constraints can be named. The benefit is that using the Alter Table command, specified restrictions can be quickly altered or deleted. When naming a constraint, use the keyword CONSTRAINT followed by the constraint’s name and its specification.
For example consider the following Create Table command –
CREATE TABLE Teacher ( Teacher_ID INTEGER, First_Name VARCHAR(20) NOT NULL, Last_Name VARCHAR(20), Gender CHAR(1), Salary DECIMAL(10,2) DEFAULT 40000, Date_of_Birth DATE, Dept_No INTEGER, CONSTRAINT TEACHER_PK PRIMARY KEY (Teacher_ID), CONSTRAINT TEACHER_FK FOREIGN KEY (Dept_No) REFERENCES Department(Dept_ID) ON DELETE SET NULL ON UPDATE SET NULL );
In the above table, the primary key constraint is named as TEACHER_PK and the foreign key constraint is named as TEACHER_FK.
Drop Table Command
This command is used to delete tables. For example, suppose you want to drop the Teacher table then the command would be:
DROP TABLE Teacher CASCADE;
Thus Teacher table would be dropped and with the CASCADE option, i.e. all the constraints that refer this table would also be automatically dropped.
However if the requirement is that the table should not be dropped if it is being referenced in some other table then RESTRICT option can be used as shown below:
DROP TABLE Teacher RESTRICT;
Alter Table Command
Adding a column – Suppose we want to add a column Age in the Teacher table. Following command is used to add the column –
ALTER TABLE Teacher ADD Age INTEGER;
Dropping a column – A column can be dropped using this command but one must specify the options (RESTRICT or CASCADE) for the drop behavior. RESTRICT would not let the column be dropped if it is being referenced in other tables and CASCADE would drop the constraint associated with this column in this relation as well as all the constraints that refer this column.
ALTER TABLE Teacher DROP Dept_No CASCADE;
Dropping keys – A foreign key/primary key/key can be dropped by using ALTER TABLE command.
ALTER TABLE Teacher DROP FOREIGN KEY TEACHER_FK;
Adding a Constraint – If you want to add the foreign key constraint TEACHER_FK back, then the command would be –
ALTER TABLE Teacher ADD CONSTRAINT TEACHER_FK FOREIGN KEY (Dept_No) REFERENCES Department(Dept_ID) ON DELETE SET NULL ON UPDATE SET NULL;
This command is used to insert a tuple in a relation. We must specify the name of the relation in which tuple is to be inserted and the values. The values must be in the same order as specified during the Create Table command. For example, consider the following table Teacher: CREATE TABLE Teacher ( Teacher_ID INTEGER, First_Name VARCHAR(20) NOT NULL, Last_Name VARCHAR(20), Gender CHAR(1), Salary DECIMAL(10,2) DEFAULT 40000, Date_of_Birth DATE, Dept_No INTEGER, CONSTRAINT TEACHER_PK PRIMARY KEY (Teacher_ID), );
To insert a tuple in the Teacher table INSERT command can be used as shown below: INSERT INTO Teacher VALUES (101,”Shanaya”, “Batra”, ‘F’, 50000, ‘1984-08-11’, 1);
This command is used to update the attribute values of one or more tuples in a table.
UPDATE Teacher SET Salary=55000 WHERE Teacher_ID=101;
In order to delete one or more tuples, DELETE command is used.
DELETE FROM Teacher WHERE Teacher_ID=101;
The SELECT Command is used to retrieve information from a database.
SELECT <attribute list> FROM <table list> WHERE <condition>
1. Query – To retrieve all the information about Teacher with ID=101. In this query we have to specify all the attributes in the SELECT clause. An easier way to do this is to use asterisk (*), which means all the attributes.
SELECT * FROM Teacher WHERE Teacher_ID=101;
2. Query – To find the names of all teachers earning more than 50000.
SELECT First_Name,Last_Name FROM Teacher WHERE salary > 50000;
3. Query – To display Teacher_ID,First_Name,Last_Name and Dept_No of teachers who belongs to department number 4 or 7.
SELECT Teacher_ID,First_Name,Last_Name, Dept_No FROM Teacher WHERE Dept_No = 4 OR Dept_No = 7;
4. Query – To retrieve names of all the teachers and the names and numbers of their respective departments. Note that the above query requires two tables – Teacher and Department. Consider the following query:
SELECT First_Name, Last_Name, Dept_ID, Dept_Name FROM Teacher, Department;
5. Query – To retrieve names of all the teachers who belong to Hindi department.
SELECT First_Name, Last_Name FROM Teacher, Department WHERE Department. Dept_ID=Teacher. Dept_ID AND Dept_Name=”Hindi”;
6. Query – To retrieve names of all the teachers starting from letter ‘S’.
SELECT First_Name FROM Teacher WHERE First_Name LIKE “S%”;
7. Query – To retrieve names of all the teachers having 6 characters in the first name and starting with ‘S’
SELECT First_Name FROM Teacher WHERE First_Name LIKE “S_ _ _ _ _”;
8. Query – To retrieve names of all the teachers having at least 6 characters in the first name.
SELECT First_Name FROM Teacher WHERE First_Name LIKE “_ _ _ _ _ _%”;
9. Query – To list the names of teachers in alphabetical order.
SELECT First_Name, Last_Name FROM Teacher ORDER BY First_Name, Last_Name;
10. Query – To list the names of all the Departments in the descending order of their names.
SELECT Dept_Name FROM Department ORDER BY Dept_Name DESC;
11. Query – To retrieve the names and department numbers of all the teachers ordered by the Department number and within each department ordered by the names of the teachers in descending order.
SELECT First_Name, Last_Name, Dept_No FROM Teacher ORDER BY Dept_No ASC, First_Name DESC, Last_Name DESC;
12. Query – To retrieve all the details of those employees whose last name is not specified.
SELECT * FROM Teacher WHERE Last_Name IS NULL;
13. Query – To find total salary of all the teachers .
SELECT SUM(Salary) AS Total_Salary FROM Teacher;
13. Query – To find the maximum and minimum salary.
SELECT MAX(Salary) AS Max_Salary, MIN(Salary) AS Min_Salary FROM Teacher;
14. Query – To count the number of teachers earning more than Rs 40000.
SELECT COUNT(Salary) FROM Teacher WHERE Salary > 40000;
15. Query – To retrieve the number of teachers in “Computer Science” Department.
SELECT COUNT(*) AS No_of_Computer_Science_Teachers FROM Department, Teacher WHERE Dept_Name = “Computer Science”AND Dept_No=Dept_ID;
Database concepts class 12 notes pdf
We are trying to provide Notes in PDF format as soon as possible.
Employability Skills Class 12 Notes
- Communication Skills Class 12 Notes
- Self Management Skills Class 12 Notes
- Basic ICT Skills Class 12 Notes
- Entrepreneurial Skills Class 12 Notes
- Green Skills Class 12 Notes
Employability Skills Class 12 MCQ
- Communication Skills Class 12 MCQ
- Self Management Skills Class 12 MCQ
- ICT Skills Class 12 MCQ
- Entrepreneurship Class 12 MCQ
- Green Skills Class 12 MCQ
Employability Skills Class 12 Questions and Answers
- Communication Skills Class 12 Questions and Answers
- Self Management Skills Class 12 Questions and Answers
- ICT Skills Class 12 Notes Questions and Answers
- Entrepreneurship Skills Class 12 Questions and Answers
- Green Skills Class 12 Questions and Answers
Information Technology Class 12 802 Notes
- Operating Web Class 12 Notes
- Java Class 12 Notes
Information Technology Class 12 802 MCQ
- Database Concepts Class 12 MCQ
- Operating Web Class 12 MCQ
- Fundamentals of Java Class 12 MCQ
Information Technology Class 12 802 Questions and Answers
- Database Concepts Class 12 Important Questions
- Operating Web Class 12 Questions and Answers
- Fundamentals of Java Programming Class 12 Questions and Answers
Database design basics
A properly designed database provides you with access to up-to-date, accurate information. Because a correct design is essential to achieving your goals in working with a database, investing the time required to learn the principles of good design makes sense. In the end, you are much more likely to end up with a database that meets your needs and can easily accommodate change.
This article provides guidelines for planning a desktop database. You will learn how to decide what information you need, how to divide that information into the appropriate tables and columns, and how those tables relate to each other. You should read this article before you create your first desktop database.
In this article
Some database terms to know, what is good database design, the design process, determining the purpose of your database, finding and organizing the required information, dividing the information into tables, turning information items into columns, specifying primary keys, creating the table relationships, refining the design, applying the normalization rules.
Access organizes your information into tables : lists of rows and columns reminiscent of an accountant’s pad or a spreadsheet. In a simple database, you might have only one table. For most databases you will need more than one. For example, you might have a table that stores information about products, another table that stores information about orders, and another table with information about customers.
Each row is more correctly called a record , and each column, a field . A record is a meaningful and consistent way to combine information about something. A field is a single item of information — an item type that appears in every record. In the Products table, for instance, each row or record would hold information about one product. Each column or field holds some type of information about that product, such as its name or price.
Top of Page
Certain principles guide the database design process. The first principle is that duplicate information (also called redundant data) is bad, because it wastes space and increases the likelihood of errors and inconsistencies. The second principle is that the correctness and completeness of information is important. If your database contains incorrect information, any reports that pull information from the database will also contain incorrect information. As a result, any decisions you make that are based on those reports will then be misinformed.
A good database design is, therefore, one that:
Divides your information into subject-based tables to reduce redundant data.
Provides Access with the information it requires to join the information in the tables together as needed.
Helps support and ensure the accuracy and integrity of your information.
Accommodates your data processing and reporting needs.
The design process consists of the following steps:
Determine the purpose of your database
This helps prepare you for the remaining steps.
Find and organize the information required
Gather all of the types of information you might want to record in the database, such as product name and order number.
Divide the information into tables
Divide your information items into major entities or subjects, such as Products or Orders. Each subject then becomes a table.
Turn information items into columns
Decide what information you want to store in each table. Each item becomes a field, and is displayed as a column in the table. For example, an Employees table might include fields such as Last Name and Hire Date.
Specify primary keys
Choose each table’s primary key. The primary key is a column that is used to uniquely identify each row. An example might be Product ID or Order ID.
Set up the table relationships
Look at each table and decide how the data in one table is related to the data in other tables. Add fields to tables or create new tables to clarify the relationships, as necessary.
Refine your design
Analyze your design for errors. Create the tables and add a few records of sample data. See if you can get the results you want from your tables. Make adjustments to the design, as needed.
Apply the normalization rules
Apply the data normalization rules to see if your tables are structured correctly. Make adjustments to the tables, as needed.
It is a good idea to write down the purpose of the database on paper — its purpose, how you expect to use it, and who will use it. For a small database for a home based business, for example, you might write something simple like "The customer database keeps a list of customer information for the purpose of producing mailings and reports." If the database is more complex or is used by many people, as often occurs in a corporate setting, the purpose could easily be a paragraph or more and should include when and how each person will use the database. The idea is to have a well developed mission statement that can be referred to throughout the design process. Having such a statement helps you focus on your goals when you make decisions.
To find and organize the information required, start with your existing information. For example, you might record purchase orders in a ledger or keep customer information on paper forms in a file cabinet. Gather those documents and list each type of information shown (for example, each box that you fill in on a form). If you don't have any existing forms, imagine instead that you have to design a form to record the customer information. What information would you put on the form? What fill-in boxes would you create? Identify and list each of these items. For example, suppose you currently keep the customer list on index cards. Examining these cards might show that each card holds a customers name, address, city, state, postal code and telephone number. Each of these items represents a potential column in a table.
As you prepare this list, don’t worry about getting it perfect at first. Instead, list each item that comes to mind. If someone else will be using the database, ask for their ideas, too. You can fine-tune the list later.
Next, consider the types of reports or mailings you might want to produce from the database. For instance, you might want a product sales report to show sales by region, or an inventory summary report that shows product inventory levels. You might also want to generate form letters to send to customers that announces a sale event or offers a premium. Design the report in your mind, and imagine what it would look like. What information would you place on the report? List each item. Do the same for the form letter and for any other report you anticipate creating.
Giving thought to the reports and mailings you might want to create helps you identify items you will need in your database. For example, suppose you give customers the opportunity to opt in to (or out of) periodic e-mail updates, and you want to print a listing of those who have opted in. To record that information, you add a “Send e-mail” column to the customer table. For each customer, you can set the field to Yes or No.
The requirement to send e-mail messages to customers suggests another item to record. Once you know that a customer wants to receive e-mail messages, you will also need to know the e-mail address to which to send them. Therefore you need to record an e-mail address for each customer.
It makes good sense to construct a prototype of each report or output listing and consider what items you will need to produce the report. For instance, when you examine a form letter, a few things might come to mind. If you want to include a proper salutation — for example, the "Mr.", "Mrs." or "Ms." string that starts a greeting, you will have to create a salutation item. Also, you might typically start a letter with “Dear Mr. Smith”, rather than “Dear. Mr. Sylvester Smith”. This suggests you would typically want to store the last name separate from the first name.
A key point to remember is that you should break each piece of information into its smallest useful parts. In the case of a name, to make the last name readily available, you will break the name into two parts — First Name and Last Name. To sort a report by last name, for example, it helps to have the customer's last name stored separately. In general, if you want to sort, search, calculate, or report based on an item of information, you should put that item in its own field.
Think about the questions you might want the database to answer. For instance, how many sales of your featured product did you close last month? Where do your best customers live? Who is the supplier for your best-selling product? Anticipating these questions helps you zero in on additional items to record.
After gathering this information, you are ready for the next step.
To divide the information into tables, choose the major entities, or subjects. For example, after finding and organizing information for a product sales database, the preliminary list might look like this:
The major entities shown here are the products, the suppliers, the customers, and the orders. Therefore, it makes sense to start out with these four tables: one for facts about products, one for facts about suppliers, one for facts about customers, and one for facts about orders. Although this doesn’t complete the list, it is a good starting point. You can continue to refine this list until you have a design that works well.
When you first review the preliminary list of items, you might be tempted to place them all in a single table, instead of the four shown in the preceding illustration. You will learn here why that is a bad idea. Consider for a moment, the table shown here:
In this case, each row contains information about both the product and its supplier. Because you can have many products from the same supplier, the supplier name and address information has to be repeated many times. This wastes disk space. Recording the supplier information only once in a separate Suppliers table, and then linking that table to the Products table, is a much better solution.
A second problem with this design comes about when you need to modify information about the supplier. For example, suppose you need to change a supplier's address. Because it appears in many places, you might accidentally change the address in one place but forget to change it in the others. Recording the supplier’s address in only one place solves the problem.
When you design your database, always try to record each fact just once. If you find yourself repeating the same information in more than one place, such as the address for a particular supplier, place that information in a separate table.
Finally, suppose there is only one product supplied by Coho Winery, and you want to delete the product, but retain the supplier name and address information. How would you delete the product record without also losing the supplier information? You can't. Because each record contains facts about a product, as well as facts about a supplier, you cannot delete one without deleting the other. To keep these facts separate, you must split the one table into two: one table for product information, and another table for supplier information. Deleting a product record should delete only the facts about the product, not the facts about the supplier.
Once you have chosen the subject that is represented by a table, columns in that table should store facts only about the subject. For instance, the product table should store facts only about products. Because the supplier address is a fact about the supplier, and not a fact about the product, it belongs in the supplier table.
To determine the columns in a table, decide what information you need to track about the subject recorded in the table. For example, for the Customers table, Name, Address, City-State-Zip, Send e-mail, Salutation and E-mail address comprise a good starting list of columns. Each record in the table contains the same set of columns, so you can store Name, Address, City-State-Zip, Send e-mail, Salutation and E-mail address information for each record. For example, the address column contains customers’ addresses. Each record contains data about one customer, and the address field contains the address for that customer.
Once you have determined the initial set of columns for each table, you can further refine the columns. For example, it makes sense to store the customer name as two separate columns: first name and last name, so that you can sort, search, and index on just those columns. Similarly, the address actually consists of five separate components, address, city, state, postal code, and country/region, and it also makes sense to store them in separate columns. If you want to perform a search, filter or sort operation by state, for example, you need the state information stored in a separate column.
You should also consider whether the database will hold information that is of domestic origin only, or international, as well. For instance, if you plan to store international addresses, it is better to have a Region column instead of State, because such a column can accommodate both domestic states and the regions of other countries/regions. Similarly, Postal Code makes more sense than Zip Code if you are going to store international addresses.
The following list shows a few tips for determining your columns.
Don’t include calculated data
In most cases, you should not store the result of calculations in tables. Instead, you can have Access perform the calculations when you want to see the result. For example, suppose there is a Products On Order report that displays the subtotal of units on order for each category of product in the database. However, there is no Units On Order subtotal column in any table. Instead, the Products table includes a Units On Order column that stores the units on order for each product. Using that data, Access calculates the subtotal each time you print the report. The subtotal itself should not be stored in a table.
Store information in its smallest logical parts
You may be tempted to have a single field for full names, or for product names along with product descriptions. If you combine more than one kind of information in a field, it is difficult to retrieve individual facts later. Try to break down information into logical parts; for example, create separate fields for first and last name, or for product name, category, and description.
Once you have refined the data columns in each table, you are ready to choose each table's primary key.
Each table should include a column or set of columns that uniquely identifies each row stored in the table. This is often a unique identification number, such as an employee ID number or a serial number. In database terminology, this information is called the primary key of the table. Access uses primary key fields to quickly associate data from multiple tables and bring the data together for you.
If you already have a unique identifier for a table, such as a product number that uniquely identifies each product in your catalog, you can use that identifier as the table’s primary key — but only if the values in this column will always be different for each record. You cannot have duplicate values in a primary key. For example, don’t use people’s names as a primary key, because names are not unique. You could easily have two people with the same name in the same table.
A primary key must always have a value. If a column's value can become unassigned or unknown (a missing value) at some point, it can't be used as a component in a primary key.
You should always choose a primary key whose value will not change. In a database that uses more than one table, a table’s primary key can be used as a reference in other tables. If the primary key changes, the change must also be applied everywhere the key is referenced. Using a primary key that will not change reduces the chance that the primary key might become out of sync with other tables that reference it.
Often, an arbitrary unique number is used as the primary key. For example, you might assign each order a unique order number. The order number's only purpose is to identify an order. Once assigned, it never changes.
If you don’t have in mind a column or set of columns that might make a good primary key, consider using a column that has the AutoNumber data type. When you use the AutoNumber data type, Access automatically assigns a value for you. Such an identifier is factless; it contains no factual information describing the row that it represents. Factless identifiers are ideal for use as a primary key because they do not change. A primary key that contains facts about a row — a telephone number or a customer name, for example — is more likely to change, because the factual information itself might change.
1. A column set to the AutoNumber data type often makes a good primary key. No two product IDs are the same.
In some cases, you may want to use two or more fields that, together, provide the primary key of a table. For example, an Order Details table that stores line items for orders would use two columns in its primary key: Order ID and Product ID. When a primary key employs more than one column, it is also called a composite key.
For the product sales database, you can create an AutoNumber column for each of the tables to serve as primary key: ProductID for the Products table, OrderID for the Orders table, CustomerID for the Customers table, and SupplierID for the Suppliers table.
Now that you have divided your information into tables, you need a way to bring the information together again in meaningful ways. For example, the following form includes information from several tables.
1. Information in this form comes from the Customers table...
2. ...the Employees table...
3. ...the Orders table...
4. ...the Products table...
5. ...and the Order Details table.
Access is a relational database management system. In a relational database, you divide your information into separate, subject-based tables. You then use table relationships to bring the information together as needed.
Creating a one-to-many relationship
Consider this example: the Suppliers and Products tables in the product orders database. A supplier can supply any number of products. It follows that for any supplier represented in the Suppliers table, there can be many products represented in the Products table. The relationship between the Suppliers table and the Products table is, therefore, a one-to-many relationship.
To represent a one-to-many relationship in your database design, take the primary key on the "one" side of the relationship and add it as an additional column or columns to the table on the "many" side of the relationship. In this case, for example, you add the Supplier ID column from the Suppliers table to the Products table. Access can then use the supplier ID number in the Products table to locate the correct supplier for each product.
The Supplier ID column in the Products table is called a foreign key. A foreign key is another table’s primary key. The Supplier ID column in the Products table is a foreign key because it is also the primary key in the Suppliers table.
You provide the basis for joining related tables by establishing pairings of primary keys and foreign keys. If you are not sure which tables should share a common column, identifying a one-to-many relationship ensures that the two tables involved will, indeed, require a shared column.
Creating a many-to-many relationship
Consider the relationship between the Products table and Orders table.
A single order can include more than one product. On the other hand, a single product can appear on many orders. Therefore, for each record in the Orders table, there can be many records in the Products table. And for each record in the Products table, there can be many records in the Orders table. This type of relationship is called a many-to-many relationship because for any product, there can be many orders; and for any order, there can be many products. Note that to detect many-to-many relationships between your tables, it is important that you consider both sides of the relationship.
The subjects of the two tables — orders and products — have a many-to-many relationship. This presents a problem. To understand the problem, imagine what would happen if you tried to create the relationship between the two tables by adding the Product ID field to the Orders table. To have more than one product per order, you need more than one record in the Orders table per order. You would be repeating order information for each row that relates to a single order — resulting in an inefficient design that could lead to inaccurate data. You run into the same problem if you put the Order ID field in the Products table — you would have more than one record in the Products table for each product. How do you solve this problem?
The answer is to create a third table, often called a junction table, that breaks down the many-to-many relationship into two one-to-many relationships. You insert the primary key from each of the two tables into the third table. As a result, the third table records each occurrence or instance of the relationship.
Each record in the Order Details table represents one line item on an order. The Order Details table’s primary key consists of two fields — the foreign keys from the Orders and the Products tables. Using the Order ID field alone doesn’t work as the primary key for this table, because one order can have many line items. The Order ID is repeated for each line item on an order, so the field doesn’t contain unique values. Using the Product ID field alone doesn’t work either, because one product can appear on many different orders. But together, the two fields always produce a unique value for each record.
In the product sales database, the Orders table and the Products table are not related to each other directly. Instead, they are related indirectly through the Order Details table. The many-to-many relationship between orders and products is represented in the database by using two one-to-many relationships:
The Orders table and Order Details table have a one-to-many relationship. Each order can have more than one line item, but each line item is connected to only one order.
The Products table and Order Details table have a one-to-many relationship. Each product can have many line items associated with it, but each line item refers to only one product.
From the Order Details table, you can determine all of the products on a particular order. You can also determine all of the orders for a particular product.
After incorporating the Order Details table, the list of tables and fields might look something like this:
Creating a one-to-one relationship
Another type of relationship is the one-to-one relationship. For instance, suppose you need to record some special supplementary product information that you will need rarely or that only applies to a few products. Because you don't need the information often, and because storing the information in the Products table would result in empty space for every product to which it doesn’t apply, you place it in a separate table. Like the Products table, you use the ProductID as the primary key. The relationship between this supplemental table and the Product table is a one-to-one relationship. For each record in the Product table, there exists a single matching record in the supplemental table. When you do identify such a relationship, both tables must share a common field.
When you detect the need for a one-to-one relationship in your database, consider whether you can put the information from the two tables together in one table. If you don’t want to do that for some reason, perhaps because it would result in a lot of empty space, the following list shows how you would represent the relationship in your design:
If the two tables have the same subject, you can probably set up the relationship by using the same primary key in both tables.
If the two tables have different subjects with different primary keys, choose one of the tables (either one) and insert its primary key in the other table as a foreign key.
Determining the relationships between tables helps you ensure that you have the right tables and columns. When a one-to-one or one-to-many relationship exists, the tables involved need to share a common column or columns. When a many-to-many relationship exists, a third table is needed to represent the relationship.
Once you have the tables, fields, and relationships you need, you should create and populate your tables with sample data and try working with the information: creating queries, adding new records, and so on. Doing this helps highlight potential problems — for example, you might need to add a column that you forgot to insert during your design phase, or you may have a table that you should split into two tables to remove duplication.
See if you can use the database to get the answers you want. Create rough drafts of your forms and reports and see if they show the data you expect. Look for unnecessary duplication of data and, when you find any, alter your design to eliminate it.
As you try out your initial database, you will probably discover room for improvement. Here are a few things to check for:
Did you forget any columns? If so, does the information belong in the existing tables? If it is information about something else, you may need to create another table. Create a column for every information item you need to track. If the information can’t be calculated from other columns, it is likely that you will need a new column for it.
Are any columns unnecessary because they can be calculated from existing fields? If an information item can be calculated from other existing columns — a discounted price calculated from the retail price, for example — it is usually better to do just that, and avoid creating new column.
Are you repeatedly entering duplicate information in one of your tables? If so, you probably need to divide the table into two tables that have a one-to-many relationship.
Do you have tables with many fields, a limited number of records, and many empty fields in individual records? If so, think about redesigning the table so it has fewer fields and more records.
Has each information item been broken into its smallest useful parts? If you need to report, sort, search, or calculate on an item of information, put that item in its own column.
Does each column contain a fact about the table's subject? If a column does not contain information about the table's subject, it belongs in a different table.
Are all relationships between tables represented, either by common fields or by a third table? One-to-one and one-to- many relationships require common columns. Many-to-many relationships require a third table.
Refining the Products table
Suppose that each product in the product sales database falls under a general category, such as beverages, condiments, or seafood. The Products table could include a field that shows the category of each product.
Suppose that after examining and refining the design of the database, you decide to store a description of the category along with its name. If you add a Category Description field to the Products table, you have to repeat each category description for each product that falls under the category — this is not a good solution.
A better solution is to make Categories a new subject for the database to track, with its own table and its own primary key. You can then add the primary key from the Categories table to the Products table as a foreign key.
The Categories and Products tables have a one-to-many relationship: a category can include more than one product, but a product can belong to only one category.
When you review your table structures, be on the lookout for repeating groups. For example, consider a table containing the following columns:
Here, each product is a repeating group of columns that differs from the others only by adding a number to the end of the column name. When you see columns numbered this way, you should revisit your design.
Such a design has several flaws. For starters, it forces you to place an upper limit on the number of products. As soon as you exceed that limit, you must add a new group of columns to the table structure, which is a major administrative task.
Another problem is that those suppliers that have fewer than the maximum number of products will waste some space, since the additional columns will be blank. The most serious flaw with such a design is that it makes many tasks difficult to perform, such as sorting or indexing the table by product ID or name.
Whenever you see repeating groups review the design closely with an eye on splitting the table in two. In the above example it is better to use two tables, one for suppliers and one for products, linked by supplier ID.
You can apply the data normalization rules (sometimes just called normalization rules) as the next step in your design. You use these rules to see if your tables are structured correctly. The process of applying the rules to your database design is called normalizing the database, or just normalization.
Normalization is most useful after you have represented all of the information items and have arrived at a preliminary design. The idea is to help you ensure that you have divided your information items into the appropriate tables. What normalization cannot do is ensure that you have all the correct data items to begin with.
You apply the rules in succession, at each step ensuring that your design arrives at one of what is known as the "normal forms." Five normal forms are widely accepted — the first normal form through the fifth normal form. This article expands on the first three, because they are all that is required for the majority of database designs.
First normal form
First normal form states that at every row and column intersection in the table there, exists a single value, and never a list of values. For example, you cannot have a field named Price in which you place more than one Price. If you think of each intersection of rows and columns as a cell, each cell can hold only one value.
Second normal form
Second normal form requires that each non-key column be fully dependent on the entire primary key, not on just part of the key. This rule applies when you have a primary key that consists of more than one column. For example, suppose you have a table containing the following columns, where Order ID and Product ID form the primary key:
Order ID (primary key)
Product ID (primary key)
This design violates second normal form, because Product Name is dependent on Product ID, but not on Order ID, so it is not dependent on the entire primary key. You must remove Product Name from the table. It belongs in a different table (Products).
Third normal form
Third normal form requires that not only every non-key column be dependent on the entire primary key, but that non-key columns be independent of each other.
Another way of saying this is that each non-key column must be dependent on the primary key and nothing but the primary key. For example, suppose you have a table containing the following columns:
ProductID (primary key)
Assume that Discount depends on the suggested retail price (SRP). This table violates third normal form because a non-key column, Discount, depends on another non-key column, SRP. Column independence means that you should be able to change any non-key column without affecting any other column. If you change a value in the SRP field, the Discount would change accordingly, thus violating that rule. In this case Discount should be moved to another table that is keyed on SRP.
Need more help?
Want more options.
Explore subscription benefits, browse training courses, learn how to secure your device, and more.
Microsoft 365 subscription benefits
Microsoft 365 training
Communities help you ask and answer questions, give feedback, and hear from experts with rich knowledge.
Ask the Microsoft Community
Microsoft Tech Community
Microsoft 365 Insiders
Was this information helpful?
Thank you for your feedback.
- Trending Now
- Data Structures
- Foundational Courses
- Data Science
- Practice Problem
- Machine Learning
- Web Development
- Web Browser
- Explore Our Geeks Community
- DBMS Tutorial - Database Management System
Basic of DBMS
- Introduction of DBMS (Database Management System) - Set 1
- History of DBMS
- Advantages of Database Management System
- Disadvantages of DBMS
- Application of DBMS
- Need for DBMS
- DBMS Architecture 1-level, 2-Level, 3-Level
- Difference between File System and DBMS
- Entity Relationship Model
- Introduction of ER Model
- Structural Constraints of Relationships in ER Model
- Difference between entity, entity set and entity type
- Difference between Strong and Weak Entity
- Generalization, Specialization and Aggregation in ER Model
- Recursive Relationships in ER diagrams
- Relational Model
- Introduction of Relational Model and Codd Rules in DBMS
- Types of Keys in Relational Model (Candidate, Super, Primary, Alternate and Foreign)
- Anomalies in Relational Model
- Mapping from ER Model to Relational Model
- Strategies for Schema design in DBMS
- Relational Algebra
- Introduction of Relational Algebra in DBMS
- Basic Operators in Relational Algebra
- Extended Operators in Relational Algebra
- SQL | Join (Inner, Left, Right and Full Joins)
- Join operation Vs Nested query in DBMS
- Tuple Relational Calculus (TRC) in DBMS
- Domain Relational Calculus in DBMS
- Functional Dependencies
- Functional Dependency and Attribute Closure
- Armstrong's Axioms in Functional Dependency in DBMS
- Equivalence of Functional Dependencies
- Canonical Cover of Functional Dependencies in DBMS
- Introduction of Database Normalization
- Normal Forms in DBMS
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Boyce-Codd Normal Form (BCNF)
- Introduction of 4th and 5th Normal form in DBMS
- The Problem of redundancy in Database
- Database Management System | Dependency Preserving Decomposition
- Lossless Decomposition in DBMS
- Lossless Join and Dependency Preserving Decomposition
- Denormalization in Databases
- Transactions and Concurrency Control
- Concurrency Control in DBMS
- ACID Properties in DBMS
- Implementation of Locking in DBMS
- Lock Based Concurrency Control Protocol in DBMS
- Graph Based Concurrency Control Protocol in DBMS
- Two Phase Locking Protocol
- Multiple Granularity Locking in DBMS
- Polygraph to check View Serializability in DBMS
- Log based Recovery in DBMS
- Timestamp based Concurrency Control
- Dirty Read in SQL
- Types of Schedules in DBMS
- Conflict Serializability in DBMS
- Condition of schedules to View-equivalent
- Recoverability in DBMS
- Precedence Graph For Testing Conflict Serializability in DBMS
- Database Recovery Techniques in DBMS
- Starvation in DBMS
- Deadlock in DBMS
- Types of Schedules based Recoverability in DBMS
- Why recovery is needed in DBMS
- Indexing, B and B+ trees
- Indexing in Databases | Set 1
- Introduction of B-Tree
- Insert Operation in B-Tree
- Delete Operation in B-Tree
- Introduction of B+ Tree
- Bitmap Indexing in DBMS
- Inverted Index
- Difference between Inverted Index and Forward Index
- SQL queries on clustered and non-clustered Indexes
- File Organization in DBMS | Set 1
- File Organization in DBMS | Set 2
- File Organization in DBMS | Set 3
DBMS Interview questions and Last minute notes
- Last Minute Notes - DBMS
- Commonly asked DBMS interview questions
- Commonly asked DBMS interview questions | Set 2
DBMS GATE Previous Year Questions
- Database Management System - GATE CSE Previous Year Questions
- Database Management Systems | Set 2
- Database Management Systems | Set 3
- Database Management Systems | Set 4
- Database Management Systems | Set 5
- Database Management Systems | Set 6
- Database Management Systems | Set 7
- Database Management Systems | Set 8
DBMS Tutorial – Database Management System
Database Management System is a software or technology used to manage data from a database. Some popular databases are MySQL, Oracle, MongoDB, etc. DBMS provides many operations e.g. creating a database, Storing in the database, updating an existing database, delete from the database. DBMS is a system that enables you to store, modify and retrieve data in an organized way. It also provides security to the database.
In this Database Management System tutorial you’ll learn basic to advanced topics like ER model, Relational Model, Relation Algebra, Normalization, File Organization, etc.
‘Recent Articles’ on DBMS !
- File Organization
- Advanced Topics
- Quick Links
- DBMS Introduction | Set 1
- DBMS Introduction | Set 2 (3-Tier Architecture)
- DBMS Architecture 2-level 3-level
- Need For DBMS
- Data Abstraction and Data Independence
- Database Objects
- Multimedia Database
- Use of DBMS in System Software
- Choice of DBMS | Economic factors
Entity Relationship Model :
- Enhanced ER Model
- Minimization of ER Diagram
- ER Model: Generalization, Specialization and Aggregation
- Recursive Relationships
Relational Model :
- Relational Model and CODD Rules
- Keys in Relational Model (Candidate, Super, Primary, Alternate and Foreign)
- Number of possible Superkeys
>> Quiz on ER and Relational Model
Relational Algebra :
- Basic Operators
- Extended Operators
- Inner Join vs Outer Join
- How to solve Relational Algebra Problems for GATE
- How to Solve Relational Algebra Problems for GATE
Functional Dependencies :
- Finding Attribute Closure and Candidate Keys using Functional Dependencies
- Armstrong’s Axioms in Functional Dependency
- Canonical Cover
- Normal Forms
- Dependency Preserving Decomposition
- Lossless Join Decomposition
- LossLess Join and Dependency Preserving Decomposition
- How to find the Highest Normal Form of a Relation
- DBMS | Data Replication
>> Quiz on Normal Forms
Transactions and Concurrency Control :
- ACID Properties
- Concurrency Control -Introduction
- Concurrency Control Protocol | Graph Based Protocol
- Concurrency Control Protocol | Two Phase Locking (2-PL)-I
- Concurrency Control Protocol | Two Phase Locking (2-PL)-II
- Concurrency Control Protocol | Two Phase Locking (2-PL)-III
- Concurrency Control Protocol | Multiple Granularity Locking
- Concurrency Control Protocol | Thomas Write Rule
- Concurrency Control | Polygraph to check View Serializabilty
- DBMS | Log based recovery
- Timestamp Ordering Protocols
- Introduction to TimeStamp and Deadlock Prevention Schemes
- Conflict Serializability
- View Serializability
- How to test if two schedules are View Equal or not ?
- Recoverability of Schedules
- Precedence Graph for testing Conflict Serializabilty
- Transaction Isolation Levels in DBMS
- Database Recovery Techniques
>> Quiz on Transactions and concurrency control
Indexing, B and B+ trees :
- Indexing and its Types
- B-Tree | Set 1 (Introduction)
- B-Tree | Set 2 (Insert)
- B-Tree | Set 3 (Delete)
- B+ Tree (Introduction)
- Bitmap Indexing
>> Practice questions on B and B+ Trees >> Quizzes on Indexing, B and B+ Trees
- File Organization – Set 1
- File Organization – Set 2 (Hashing in DBMS)
- File Organization – Set 3
- File Organization – Set 4
>> Quiz on File structures
Advanced Topics :
- Query Optimization
- How to store a password in database?
- Storage Area Networks
- Network attached storage
- Data Warehousing
- Data Warehouse Architecture
- Characteristics and Functions of Data warehouse
- Difficulties of Implementing Data Warehouses
- Data Mining
- Data Mining | KDD process
- Data Mining | Sources of Data that can be mined
- ODBMS – Definition and overview
- Architecture of HBase
- Apache HBase
- Architecture and Working of Hive
- Apache Hive
- Difference between Hive and HBase
- SQL | Tutorials
- Quiz on SQL
DBMS practices questions :
- Database Management Systems | Set 1
- Database Management Systems | Set 9
- Database Management Systems | Set 10
- Database Management Systems | Set 11
Advantages of DBMS
There are some following reasons to learn DBMS:
- Organizing and management of data: DBMS helps in managing large amounts of data in an organized manner. It provides features like create, edit, delete, and read.
- Data Security: DBMS provides Security to the data from the unauthorized person.
- Improved decision-making: From stored data in the database we can generate graphs, reports, and many visualizations which helps in decision-making.
- Consistency: In a traditional database model all things are manual or inconsistent, but DBMS enables to automation of the operations by queries.
FAQs on Database Management System(DBMS)
Q.1 what is database.
A database is a collection of organized data which can easily be created, updated, accessed, and managed. Records are kept maintained in tables or objects. A tuple (row) represents a single entry in a table. DBMS manipulates data from the database in the form of queries given by the user.
Q.2 What are different languages present in DBMS?
DDL (Data Definition Language) : These are the collection of commands which are required to define the database. E.g., CREATE, ALTER, RENAME, TRUNCATE, DROP, etc. DML (Data Manipulation Language) : These are the collection of commands which are required to manipulate the data stored in a database. E.g., SELECT, UPDATE, INSERT, DELETE, etc. DCL (Data Control Language) : These are the collection of commands which are dealt with the user permissions and controls of the database system. E.g, GRANT, and REVOKE. TCL (Transaction Control Language) : These are the collection of commands which are required to deal with the transaction of the database. E.g., COMMIT, ROLLBACK, and SAVEPOINT.
Q.3 What are the ACID properties in DBMS?
The full form of ACID is Atomicity, Consistency, Isolation, and Durability these are the properties of DBMS that ensure a safe and secure way of sharing data among multiple users. A – Atomic: All changes to the data must be performed successfully or not at all. C – Consistent: Data must be in a consistent state before and after the transaction. I – Isolated: No other process can change the data while the transaction is going on. D – Durable: The changes made by a transaction must persist.
Q.4 What are the Advantages of DBMS?
The followings are the few advantages of DBMS : Data Sharing: Data from the same database can be shared by multiple users at the same time. Integrity: It allows the data stored in an organized and refined manner. Data Independence: It allows changing the data structure without changing the composition of executing programs. Data Security: DBMS comes with the tools to make the storage and transfer of databases secure and reliable. Authentication and encryption are the tools used in DBMS for data security.
Quick Links :
- Last Minutes Notes(LMNs) on DBMS
- Quizzes on DBMS !
- ‘Practice Problems’ on DBMS !
- DBMS interview questions | Set 1
- DBMS interview questions | Set 2
Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.
Improve your Coding Skills with Practice
- Interview Q
Data modeling, relational data model, normalization, transaction processing, concurrency control, file organization, indexing and b+ tree, sql introduction.
- Send your Feedback to [email protected]
Help Others, Please Share
Learn Latest Tutorials
Python Design Patterns
B.Tech / MCA
JavaTpoint offers too many high quality services. Mail us on h [email protected] , to get more information about given services.
- Website Designing
- Website Development
- Java Development
- PHP Development
- Graphic Designing
- Digital Marketing
- On Page and Off Page SEO
- Content Development
- Corporate Training
- Classroom and Online Training
Training For College Campus
JavaTpoint offers college campus training on Core Java, Advance Java, .Net, Android, Hadoop, PHP, Web Technology and Python. Please mail your requirement at [email protected] . Duration: 1 week to 2 week
Domains can be thought of as distinct functional and/or structural units of a protein. These two classifications coincide rather often, as a matter of fact, and what is found as an independently folding unit of a polypeptide chain also carries specific function. Domains are often identified as recurring (sequence or structure) units, which may exist in various contexts. The image below illustrates four "domains" identified as structural units in the MMDB-entry 1IGR , chain A, as segments colored in magenta, blue, brown, and green. In molecular evolution such domains may have been utilized as building blocks, and may have been recombined in different arrangements to modulate protein function. We define conserved domains as recurring units in molecular evolution, the extents of which can be determined by sequence and structure analysis . Conserved domains contain conserved sequence patterns or motifs, which allow for their detection in polypeptide sequences. The distinction between domains and motifs is not sharp, however, especially in the case of short repetitive units. Functional motifs are also present outside the scope of structurally conserved domains. The CD database is not meant to systematically collect such motifs.
The two types of domains shown in the 1IGR illustration above -- 3D domains and conserved domains (or "domain families") -- often coincide with each other. Type 1 Insulin-like Growth Factor Receptor (1IGR) illustration above. --> However, because they represent two distinct types of data -- 3D structures and protein sequences, respectively -- they reside in two distinct databases: the Entrez Structure (Molecular Modeling Database, MMDB) and the Conserved Domain Database (CDD) . The former includes the spatial (X,Y,Z) coordinates of each atom in a structure (where 3D domains are identified algorithmically), while the latter shows the span and composition of a conserved protein sequence region. Specifically, conserved domain models are based on multiple sequence alignments of related proteins spanning a variety of organisms to reveal sequence regions containing the same, or similar, patterns of amino acids. The illustration below provides an example, showing the multiple sequence alignment for the Furin-like domain, which is present in the Type 1 Insulin-like Growth Factor Receptor (1IGR) protein. Click anywhere on the image to open the complete, interactive CDD record for that domain model, cd00064 . A separate section of this help document provides additional information about multiple sequence alignment display options . In the CDD database, protein sequences from three-dimensional structures are included in domain models whenever possible, as one goal of the NCBI conserved domain curation effort is to make multiple sequence alignments agree with what we can infer from three-dimensional structure and three-dimensional structure superposition, in order to understand sequence/structure/function relationships. The sequence-based domain models and corresponding 3D structures are also cross-referenced to each other through Entrez "Links" between CDD and structure records.
Conserved Domains can be described by local multiple sequence alignments (illustration) spanning a variety of organisms to reveal sequence regions that contain the same, or similar, patterns of amino acids. Computational biologists from all over the world have compiled collections of such alignments representing conserved domains. CDD includes domains curated at NCBI as well as data imported from the external sources listed below, and data sources are indicated by their accession number prefixes . The source databases differ in their scope of coverage and the method by which they develop their models. Therefore, each source database may have its own model for a given conserved domain, in addition to some domain models found only in that database. To provide a non-redundant view of the data, CDD clusters similar domain models from various sources into superfamilies . The data sources include:
NCBI-curated domains use 3D-structure information to explicitly to define domain boundaries, aligned blocks, and amend alignment details. More details about the unique features of NCBI-curated domains are below. The goal of the curation project is to provide CDD users with insights into how patterns of residue conservation and divergence in a family relate to functional properties, and to provide useful links to more detailed information that may help to understand those sequence/structure/function relationships. The presence of conserved features help to affirm family membership in search results with borderline significance, for example. NCBI CDD Curators provide feature annotation and associated evidence in a computer friendly way, so that the scientific community can build software tools for the automation of tasks like annotation transfer, for example. more...
NCBIfams is a collection of protein family hidden Markov models (HMMs) for improving bacterial genome annotation. A paper by Haft et al. (2018) provides additional information about NCBIfams, which is part of NCBI's Reference Sequence ( RefSeq ) project.
External Data Sources
In addition, CDD imports data from five other major sources, below. The version number (as available) of each source database that is imported into CDD is provided in the CDD News page. Abbreviation Database Name Description SMART S imple M odular A rchitecture R esearch T ool SMART is a web tool for the identification and annotation of protein domains, and provides a platform for the comparative study of complex domain architectures in genes and proteins. SMART is maintained by Chris Ponting, Peer Bork and colleagues, mainly at the EMBL Heidelberg. CDD contains a large fraction of the SMART collection. more... Pfam P rotein fam ilies Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families . Pfam is maintained by Alex Bateman and colleagues, mainly at the Wellcome Trust Sanger Institute. CDD contains a large fraction of the Pfam collection. more... COGs C lusters of O rthologous G roups of proteins COGs is an NCBI-curated protein classification resource. Sequence alignments corresponding to COGs are created automatically from constituent sequences and have not been validated manually when imported into CDD. more... TIGRFAMs T he I nstitute for G enomic R esearch's database of p rotein fam ilies TIGRFAMs , a research project of the J. Craig Venter Institute, is a collection of manually curated protein families from The Institute for Genomic Research and consists of hidden Markov models (HMMs), multiple sequence alignments, Gene Ontology (GO) terminology, cross-references to related models in TIGRFAMs and other databases, and pointers to literature. more... PRK PR otein K (c)lusters Protein Clusters is an NCBI collection of related protein sequences (clusters) consisting of Reference Sequence proteins encoded by complete prokaryotic and chloroplast plasmids and genomes. It includes both curated and non-curated (automatically generated) clusters. CDD also contains data from additional research projects, such as KOGs (a eukaryotic counterpart to COGs) and the Library of Ancient Domains ( LOAD ), contributed by I. Aravind, E. Koonin, and colleagues. The KOGs data set is accessible as a separate CD-Search database / Batch CD-Search database , and the LOAD data set is available on the FTP site, but neither of those data sets is directly searchable by text term in Entrez CDD. The content of imported domain models is determined by the providers of the source database, with slight modifications made at NCBI to link a domain model's member sequences to corresponding, complete protein sequence and 3D structure records in Entrez databases, when possible. The method by which imported domain models are integrated into the CDD database is described in the CD assembly process section of this help document.
Accession Prefixes indicate data sources:
Source databases are evident from CD accessions: Accession starts with: Source Database cd Curated at NCBI sd Domain models specifically built to annotate structural motifs ; this is a subset of the NCBI-curated domain models. NF NCBIfams pfam Pfam smart SMART COG COGs KOG KOGs (available as a separate search set via CD-Search (RPS-BLAST); not searchable by text term in Entrez) PRK PRotein K(c)lusters (Entrez database) CHL Chloroplast and organelle proteins; subset of the PRK database. MTH Mitochondrial proteins; subset of the PRK database. PHA Phage proteins; subset of the PRK database. PLN Plant-specific (non-chloroplast) proteins; subset of the PRK database. PTZ Protozoan proteins; subset of the PRK database. TIGR TIGRFAMs LOAD_ Library of Ancient Domains (LOAD) data set. (available as a separate data set via FTP ; not searchable by text term in Entrez) Accessions that start with " cl " are for superfamily cluster records, which can contain domain models from one or more source databases. When searching CDD, it is possible to limit search results to domains from any given source database by using the Database Search Field .
NCBI-curated domain models are assembled using the methods briefly described in the source databases section of this document. More details about the NCBI curation process are provided by Marchler-Bauer, et al. (2007). An example of a multiple sequence alignment on which a model is based is shown in an illustration of the Furin-like domain. Domain models from external data sources are assembled by various methods, ranging from automated processing to manual curation, depending on the individual source database. Entrez Protein and Moleclular Modeling Database (MMDB) , when possible. --> Upon import into CDD, protein sequence alignments (illustration) from each of the source databases are processed in an automated way to provide links from each aligned sequence to the corresponding, complete record in the Entrez Protein database. Occasionally, sequences that cannot be identified in Entrez's databases are omitted or substituted for closely related matches. Whenever possible, sequences in PFAM, SMART, and COGs alignments are substituted for closely related sequences (passing a stringent sequence similarity threshold) that have direct links to three-dimensional structures in the Moleclular Modeling Database (MMDB) . A representative sequence is chosen for each domain model, preferably with a structure-link, for technical reasons. The representative sequence is generally shown as the first member of the multiple sequence alignment for a domain model. By default, this representative is the 3D structure shown when CD alignments are visualized with Cn3D. A consensus sequence is computed from the imported alignments. Alignment columns have to be represented in at least 50% of all aligned sequences (weighted by diversity) to determine the extent of the consensus. The most frequently occurring residue in each column (after weighting to account for redundancy) is reported. A position-specific scoring matrix ( PSSM ) is calculated for the extent of the consensus sequence . The PSSM profiles the various amino acids that were present in a given position of the multiple sequence alignment for a domain model and how frequently each one was observed. The consensus sequence does not contribute to the residue frequency statistics. Each PSSM receives a unique identifier ( PSSM ID ). A PSSM ID is the unique identifier for a domain model's position-specific scoring matrix (PSSM) . If a domain model's PSSM changes in any way as a result of updates to its multiple sequence alignment, it receives a new PSSM ID. This happens because a conserved domain model can evolve over time. For example, as new sequence data become available, the curators of a source database might add sequences to a multiple sequence alignment or update the sequences already present. As a result of such changes to the domain model, the PSSM and its ID can change. (Additional notes: Each superfamily record in the Conserved Domain Database also has a PSSM ID, which refers to the specific set of conserved domain PSSM IDs that comprise the superfamily, rather than to an actual position-specific scoring matrix for the overall superfamily. Obsolete PSSMs (e.g., 667) cannot be retrieved through the Entrez CDD search interface because they are no longer indexed. However, they can be retrieved from the archival copy of the database by using the "Direct Fetch via UID" option on the CDD Search Methods page.) Search databases compiled of these PSSMs are available through the CD-Search service (see help document ) and on the NCBI FTP site as collections of pre-computed RPS-BLAST databases that can be used for locally installed versions of that program.
As noted in the section on CDD data sources , NCBI-curated domains use 3D-structure information to explicitly to define domain boundaries, aligned blocks, and amend alignment details. The goal of the NCBI conserved domain curation project is to provide database users with insights into how patterns of residue conservation and divergence in a family relate to functional properties, and to provide useful links to more detailed information that may help to understand those sequence/structure/function relationships . To do this, CDD Curators include the following types of information in order to supplement and enrich the traditional multiple sequence alignments that form the foundation of domain models: 3-dimensional structures and conserved core motifs: NCBI Conserved Domain Curators have re-evaluated and modified multiple sequence alignments imported from outside sources, and made them agree with what we can infer from three-dimensional structure and three-dimensional structure superposition. Curated alignments contain aligned blocks spanning all rows (with no gaps allowed inside blocks) and unaligned regions between blocks. The blocks are meant to represent conserved structural core motifs of the corresponding domain family. The 3D structures can be viewed interactively with the Cn3D structure viewing program. More information about viewing structures is provided in the section of this document on CD summary pages , and the illustration at the right provides an example of a protein structure that has been annotated by NCBI curators to highight the Cl- binding residues. ( Click on the illustration to open the current, interactive record for the Voltage-Gated Chloride Channel domain model, cd00400 , in the Conserved Domain Database (CDD). From there, you can open an interactive version of the 3D structure, with conserved feature annotations, in the free Cn3D structure viewing program.) Conserved features/sites: In addition to working on the alignment model (illustration) , NCBI curators also record, when possible, the location and nature of features conserved in the domain family. Typically these would describe catalytic residues , binding sites , or motifs commonly referred to in the literature. CD summary pages (in the conserved features/sites summary box and as hash marks (#) in the multiple sequence alignment displays ), and with the Cn3D structure viewing program. An example is shown in the illustration at the right. --> Conserved features/sites: In addition to working on the alignment model, NCBI curators also record, when possible, the location and nature of features conserved in the domain family. Typically these would describe catalytic residues , binding sites , or motifs commonly referred to in the literature. -->Features are added if they seem applicable to the family described in the CD's scope and if there is evidence linking the feature to a set of addresses on the alignment. Such evidence is recorded and available for inspection; it may be free-text comments, citations linked to PubMed, or "structure evidence" - exemplifying the existence of a site by highlighting an actual molecular complex, for example. Both features and evidence can be visualized on CD summary pages (in the conserved features/sites summary box , and as hash marks (#) in the multiple sequence alignment displays ), and with the Cn3D structure viewing program. An example is shown in the illustration at the right. ( Click on the illustration to open the current, interactive record for the Voltage-Gated Chloride Channel domain model, cd00400 , in the Conserved Domain Database (CDD). Note that the live web page may look different from the illustration shown here, because the Conserved Domain Database continues to evolve with the addition of new data; however, the concepts shown in the illustration remain stable.) In addition, the CD-Search tool can be used to identify conserved features in a query protein sequence , designated by small triangles ( illustrated example ) in the search results graphical summary, when such features can be mapped from the conserved domain annotations to the query sequence. Clicking on a folder tab in the Conserved Features/Sites summary box will show the details for that feature, and it will refresh the mutliple sequence alignment display with an extra alignment row that shows the feature number and uses hash-marks (#) to indicate the specific residues involved. Only one feature at a time is shown in the multiple sequence alignment display. Phylogenetic organization: Based on evidence from sequence comparison, NCBI Conserved Domain Curators attempt to organize related domain models into phylogenetic family hierarchies ( illustrated example ) . The CDTree program used by NCBI curators can be downloaded in order to view NCBI-curated domains interactively and in greater detail. Links to electronic literature resources: NCBI curated domains also provide links to citations in PubMed and NCBI Bookshelf that discuss the domain. These references are selected by curators and, whenever possible, include articles that provide evidence for the biological function of the domain and/or discuss the evolution and classification of a domain family. NCBI-curated domains can be recognized in CDD search results by their " cd " accession number prefix . It is also possible to limit CDD search results to domain models from any given source database by using the Database Search Field .
A domain family hierarchy is a set of related domains that share a common ancestor, a common set of conserved residues, and a common general function, but differ from each other in their specific phylogeny, specific functions, and additional spans of conserved residues. Domain hierarchies are present in NCBI-curated domains in order to provide insights into how patterns of residue conservation and divergence in a family relate to functional properties. Some domain families have only a single node, while others have a hierarchy that is two or more levels deep, sometimes with numerous nodes (" subfamilies ")at each level. Such hierarchies have generic "parent" models and more specific "children". The parent node contains a span of conserved residues that is also present in each of the children. Each of the child nodes can have additional conserved residues that extend beyond that span and help to further characterize the members of the child node. NCBI CDD Curators attempt to split "children" nodes where they see evidence for ancient gene duplications resulting in orthologous groups, often occurring together with functional divergence. The CDTree program used by NCBI curators can be downloaded in order to view NCBI-curated domains interactively and in greater detail, with or without a query sequence embedded . An illustrated example of a subfamily hierarchy is provided below. Click anywhere on the image to open the current, interactive record for the Voltage-Gated Chloride Channel domain model, cd00400 , in the Conserved Domain Database (CDD). Note that the live web page may look different from the illustration shown here, because the Conserved Domain Database continues to evolve with the addition of new data; however, the concepts shown in the illustration remain stable.
A superfamily cluster is a set of conserved domain models that generate overlapping annotation on the same protein sequences. These models are assumed to represent evolutionarily related domains and may be redundant with each other. A superfamily accession number begins with the prefix " cl " for "cluster". (Some superfamilies contain only a single conserved domain model (singleton), and these are not indexed in Entrez. Only superfamilies that contain two or more conserved domain models are indexed in Entrez and will therefore appear in search results.)
Clustering methodology : Superfamily members are clustered through an automated process that involves the following steps: Identify domain models that have overlapping hits on sequences in the Entrez Protein database from at least five different protein identity groups (PIGs) identical protein groups (IPGs) . Technical note: In the data processing pipeline at NCBI, protein sequence records that contain an identical sequence, regardless of TaxID, are placed in an identical protein group (IPG) , and each group is given a stable unique identification number (referred to as IPG ID or UID). Store the overlapping domain models as pairwise associations , and use those pairwise associations to populate a similarity matrix . Refine the similarity matrix by comparing it against a " blacklist " (to remove unacceptable pairwise associations), and against a " whitelist " (to add pairwise associations that known to be valid but are not yet listed): The " blacklist " is used to separate domain models that should never be paired. The blacklist overrides all other aspects of the clustering algorithm. For example: cd00538 (PA domain hierarchy) is blacklisted against pfam00082 (a serine protease model, subtilase family) cd00538 (PA domain hierarchy) is blacklisted against cd08022 (an M28-family metalloprotease) These pairings above are forbidden because the PA (protease-associated) domain is often inserted in a protease domain, yet the proteases that contain the insert are distinct from each other. For example, if a PA domain is in a serine protease and also in a metalloprotease, the two types of protease would be clustered together by the algorithm in the absence of a black list. However, the metalloprotease and serine protease are actually distinct, and simply represent convergent evolution to similar function. The " whitelist " is used to add pairwise associations of domain models that are known to be related, but that might not have been listed in the initial, unrefined similarity matrix. For example: NCBI-curated domain models that are organized hierarchically are part of the white list. Conserved domain models from external databases can also be grouped together, if those domains are known to be related but were not grouped automatically by the clustering algorithm. An example of such a whitelisted pair is shown in cluster cl23875 : MvaI_BcnI Superfamily, which includes pfam15515 (MvaI/BcnI restriction endonuclease family) and pfam09562 (LlaMI restriction endonuclease), as of 30 January 2018. Those two conserved domain models for restriction endonucleases (REs) have relatively few hits in common. RE superfamilies are very sequence-diverse, meaning that models for specific subfamilies can differ quite a bit in terms of overall length and conserved residue patterns/signatures. Nevertheless, the models are whitelisted together because they are known to be related. Take the refined similarity matrix and feed it to the Python "fastcluster" package ( https://pypi.python.org/pypi/fastcluster ), to create clusters using the "complete linkage" algorithm . Implement a post-processing step to compare the Python-generated clusters against the whitelist , in case Python did not put two domain models in the same cluster but should have. NOTE: Multi-domain models that were computationally detected are not included in Superfamily clusters. These models are likely to contain multiple single domains and might falsely join superfamily clusters.
Rationale : Superfamilies provide a method for organizing data within CDD in a non-redundant way. CDD contains conserved domains from a number of different source databases , each of which may have its own model for a given conserved domain. The models might share many similiarities in their reported residue conservation patterns, but differ in the specific protein sequences used in the multiple alignment, their footprint length [domain boundaries], and biological annotations. Because of the similarities, RPS-BLAST might find that multiple domain models align to the same general region of a query protein, but have different footprints and E-value scores relative to the query protein. If the footprints of two or more domain models overlap on the query, those models are clustered into the same superfamily, then the superfamily continues to be extended using the methodology described above.
Example : One example of a superfamily is Cluster ID cl02915 , which contains various domain models for the voltage-gated chloride channel. Superfamily members include the NCBI-curated domain cd00400 and all members of that family hierarchy plus domain models from external resources .
Selection of Superfamily Representative : A superfamily can contain one to many domain models. As of spring 2008, approximately 70% of the ~9,000 superfamilies contain a single model and the rest contain multiple models. Single model superfamilies often represent proteins specific to certain organisms or taxonomic lineages (for example, viruses). The numbers of superfamilies containing single or multiple domain models will continue to evolve as new domains are imported and new NCBI-curated hierarchies are added. In superfamilies contatining multiple domain models, one of the models is selected as the source of the superfamily name and description . The representative is one of the following, listed in priority order: the parent node of an NCBI-curated domain family hierarchy , if one is present in the superfamily cluster. In the few cases where a superfamily contains more than one NCBI-curated domain, the parent of the hierarchy with the largest number of sequence hits is chosen as the superfamily representative. the Pfam domain model that hits the largest number of Entrez protein sequences in an RPS-BLAST search the SMART, COG, PRK, or CHL model that hits the largest number of Entrez protein sequences in an RPS-BLAST search the sole member of a superfamily
Superfamily can change over time : The composition of a cluster can change over time due to a variety of factors, such as: availability of new domain models as the Conserved Domain Database continues to grow changes to previously existing models new and/or updated sequence records in the Entrez Protein database refinements to the automated clustering procedures A superfamily cluster accession number will remain the same if at least 50 percent of its member models (conserved domain accessions) have not changed relative to the previous version of the cluster. If more than 50 percent of the conserved domain accessions from a previous version of a cluster are no longer present in the new build of that cluster, or if the cluster size more than doubles with a new build, then the superfamily cluster accession is retired and replaced by a new accession(s). If two previous clusters merge into a single new cluster, the superfamily cluster accession number of the larger component cluster is used for the new grouping. A superfamily also has a PSSM ID , which refers to the specific set of PSSM IDs for the domain models that comprise the superfamily, rather than to an actual position-specific scoring matrix for the overall superfamily. The superfamily PSSM ID will change if there is any change to the set of member PSSM IDs relative to the previous version of the cluster (e.g., if a member conserved domain gets a new PSSM ID due to changes in its multiple sequence alignment, of if a new conserved domain model is added to the superfamily as the result of a CDD database update). The CD summary page for a Superfamily record Cluster ID cl02915 ) --> does not include a multiple sequence alignment display; rather, it provides the name and description of the superfamily and lists the domain models that belong to it. The multiple sequence alignment for any member domain model can be viewed by clicking on it to open its CD summary page . Superfamilies that contain a single domain model ("singletons") The concept of superfamilies was applied to CDD in order to cluster related conserved domain models together and provide a non-redundant view of the available domain models. After the superfamily clustering algorithm is applied to the domain models in CDD, all resulting clusters are referred to as superfamilies, regardless of how many domain models they contain. The non-redundant view of CDD therefore includes superfamilies with a single domain model ("singletons") as well as superfamilies containing two or more domain models. In the user interface, however, superfamilies that contain only one model are not displayed in search results, or listed as links from the domain model, because they look very similar to the model itself. In contrast, superfamilies that contain two or more models ("multi-model superfamilies") are displayed in search results, and are also accessible as links from their member domain models. The number of multi-model superfamilies is provided in the Database Statistics box on the "Conserved Domains and Protein Classification News " page, and they can be retrieved by clicking on that statistic.
Protein Query Sequence ( CD-Search ) :
Most users will explore conserved domains starting from CD-Search results for a protein of interest. The query can be a protein sequence in FASTA format or the GI or Accession of a protein sequence that exists in the Entrez Protein database. The search results will show the conserved domains found in the protein. The colored bars that depict the domain footprints (shown in both the concise display and full display of CD-Search results) are active hotlinks that open the corresponding CD summary pages with your query sequence embedded in the multiple sequence alignment of proteins used to create the domain model. The second half of this help document provides details on how to use the CD-Search service , including input required and output shown .
Text Term Search in Entrez CDD :
Allowable search terms
Conserved domains can be searched by text term in the Entrez CDD database . The Entrez query interface allows searching for keywords, publication dates, and taxonomic span, accesssion numbers, and more. The search field summary table in this document shows the variety of terms that can be used to query the database and provides sample searches. It is also possible to use quotes to force multiple terms to be searched as a phrase, and to use an asterisk (*) as a wild card to search for a word stem. For example, search the Entrez CDD database for strings like "Kinase" or "pfam023*" or "Tetratrico*" to see how it works: for A number of techniques can be used to search the database, offering varying degrees of control over your query. The search methods summary table provides examples of basic and advanced searches. In basic searches , you can just enter one or more search terms without specifying search fields, Boolean operators, or other search criteria. These searches are quick and easy but can result in some extraneous hits. Advanced search methods , on the other hand, allow you to exercise greater control over your search, for example, by specifying which search field to use for each query term, limiting search results to a particular type of record or source database, or refining your search in other ways. A separate section of this help document describes the CDD search results . (The PubMed help document and Entrez help document provide additional, general information about using the Entrez search system.)
Document Summary (DocSum) Page
After querying Entrez CDD by text term , the initial search results page (also referred to as the document summary, or "DocSum") provides a list of the conserved domain records that contain your search terms . The terms can appear in any field of the record , unless a search field was specified in the query. (Note: A separate part of this document describes the results of a search by protein query sequence using CD-Search .) Click on the accession number or thumbnail image of any record on the DocSum page to view its conserved domain (CD) summary page . If desired, you can narrow your search by restricting the query to a search field of interest or adding more terms with a Boolean AND. Alternatively, you can broaden your search by adding more terms (e.g., synonyms) to your query with a Boolean OR, or by following links to Superfamily Members .
The "Display settings" menu on acts upon all of the conserved domain records (default) in your search results , or on the subset you have selected with checkboxes . You can select items from multiple pages of the search results, if desired. Menu Name Comments Format Summary -- a summary of all of the structure records (default) retrieved by your search, or for those you have selected with checkboxes, in HTML format . The information shown for each record may include the following, as available: Short name , which concisely defines the conserved domain Thumbnail image indicating if the conserved domain includes a protein sequence from a 3D structure . If a 3D structure is included, the thumbnail will be a still graphic of the actual domain structure. If no 3D structure is available for the protein family from which the domain model was created, the thumbnail icon will show a schematic of a multiple sequence alignment. First 100 characters of the text summary , which provides a synopsis of biological function and salient features of the domain Accession number PSSMid A subset of links to additional information about the domain, including a " View in Cn3D " link that opens an interactive view of the domain's 3D structure in NCBI's free Cn3D structure viewing program and links to related data in other Entrez databases. (Note: The " Find Related Data " menu in the right margin of the search results page provides a complete list of links. That menu retrieves related data for all records (default) retrieved by your search, or for the subset of records you have selected with checkboxes.) Summary (text) -- a summary of the records retrieved by your search, in plain text format . By default, all records from your search result are listed. If you are interested only in specific records, select their checkboxes, select the desired display settings, and press "Apply" to view only those records. The information shown for each record is the same as in the " Summary " format described above, but does not include the subset of links to additional information. UI List -- a list of the unique identifiers (UI's) for all of the conserved domain records (default) retrieved by your search, or for those you have selected with checkboxes. Items per page By default, 20 documents are listed per page. If desired, decrease (to a minimum of 5) or increase (to a maximum of 200) the number of documents displayed per page then press the "Apply" button. Sort by Search results are displayed in order of decreasing relevance with respect to the query. Many search fields have a score or rank associated with them; for example, the Title and Organism fields have a high rank, while the Description of Sites field has a lower rank. The presence of a search term in any one or more of the fields is scored accordingly by the search system, and the total score given to a hit is used in determining its relevance to the query and therefore its placement on the search results page. Additional options are available to sort records by descending or ascending order of Accession , Database , Modification Date , Number of Sites , PSSM Length , Publication Date , and Structure Representatives . A few of these sort options will cause certain types of records to cluster at the top or bottom of the search results, depending on whether ascending (up) or descending (down) order is chosen. For example, if you sort by: Number of Sites - NCBI-curated domain models will appear at the top of search results (if "sort by number of sites (down)" order is selected) because conserved features , such as catalytic or binding sites, are annotated only on those domain models. PSSM Length - the superfamily records will appear at the bottom (if "sort by PSSM length (down)" is selected) because they do not have an actual Position Specific Scoring Matrix (PSSM) . Rather, each member of a superfamily has a PSSM and corresponding PSSM ID . A superfamily's PSSM ID refers to the specific set of conserved domain PSSM IDs that comprise the superfamily, rather than to an actual position-specific scoring matrix for the overall superfamily. Structure Representatives - NCBI-curated domain models will tend to appear at the top of search results (if "sort by structure representatives (down)" order is selected) because the curation process includes incorporation of protein sequences from resolved structures. Domain models from external data sources may also contain structure representatives. As noted in the section on data processing , sequences in PFAM, SMART, and COGs alignments are substituted, whenever possible, for closely related sequences that have direct links to three-dimensional structures in the Moleclular Modeling Database (MMDB) . Technical note : If you retrieve all records in the database by searching the Filter field for All[Filt] , the records are simply displayed in descending order of UID (i.e., PSSM ID ). detailed view ( CD Summary page ) for a conserved domain can only be viewed for one record at a time by following the link for that record's title or thumbnail image . The right margin of the search results page includes a " Find Related Data " box that can be used to retrieve additional data, from within CDD and from other Entrez databases, that are related to the conserved domains retrieved by your search. For example, the Related CDs / Superfamily Member Links option will retrieve the other domain models in CDD that appear to be evolutionarily related to or redundant with the domains listed/selected on the page. The Protein , Structure , Gene , PubMed , etc. links traverse to associated data in those Entrez databases. -->
"Send To" menu options
Filter your results
The " Filter your results " area in the upper right corner of a search results page allows you to see all the records (default) retrieved by your search, or subsets of your search results that reflect commonly requested categories of records, and shows the corresponding number of records in each case. The links for " NCBI-curated ," " imported ," " families " (individual conserved domain models), and " superfamilies " (clusters of evolutionarily related conserved domain models into which the individual conserved domain models fall) show the number of retrieved records that fall into each of those categories, and allow you to view those subsets of your search results, if desired. Selected Structures box enables you to view other subsets from your search results. -->
Find related data:
The " Related information " box that appears in the right margin of the display for an individual record allows you to retrieve related data for that particular domain model. (For example, the " Related CDs / Superfamily Members " link for accession cd00400 will retrieve the other domain models in the Conserved Domain Database that appear to be evolutionarily related to or redundant with cd00400.) cd00400 complete record by following the link for the Superfamily cluster number cl02915 in the "Links" box.) --> A " Find Related Data " box (instead of an "Related information" box) will appear in the right margin of a CDD search results page if you retrieved two or more records . The "Find Related Data" box allows you to retrieve related data for all the models retrieved by your search (default), or for the domain models you have selected with checkboxes. (For example, the "Find Related Data" option for Related CDs / Superfamily Member Links will retrieve the other domain models in CDD that appear to be evolutionarily related to or redundant with the domains retrieved by your search, or with the domains you have selected with checkboxes.) Protein , Structure , Gene , PubMed , etc. links traverse to associated data in those Entrez databases. --> The links in either display can include the following, depending on the related data that are available for the domains you have retrieved: Related CDs Literature Sequence Structure BioSystems Other Links A " Links " box also appears in the displays of individual conserved domain records. All links are described in the help document section on " CDD Record (CD summary page): What information is displayed for each domain model on its CD Summary page? " : " Links to related data in Entrez ". The number and type of links that exist vary among CDD records, depending on the related data that are available for any given record. Most links are accessible on both the search results page and on a CD summary page , although a few of the links are available in only one of those places ( * ), such as Representatives and Books links, which are available only on the CD summary page. FLink icon appears beside a link name (as it does for BioSystems , for example), clicking on that icon will open the data as a ranked list in the FLink tool. Clicking on the link name, in contrast, will open the data directly in the corresponding database, in the default sort order for that database. The About FLink page provides an overview of the tool, and the FLink help document provides additional detail about linking to FLink from Entrez search results . -->
Domain Architecture (CDART) : Domain architecture: CDART : The Conserved Domain Architecture Retrieval Tool (CDART) program has been used to analyze the domain architecture of all sequence records in the Entrez Protein database, and to identify proteins with similar architecture. Those proteins are accessible by selecting " Domain Relatives " in the " Links " menu of a protein sequence record of interest ( illustrated example) . Or, you can search CDART directly by entering a query protein sequence in FASTA format, or entering the GI or Accession number of a protein sequence that already exists in the Entrez Protein database. CDART will then retrieve proteins that contain one or more of the domains present in the query sequence. More information about CDART is available in the overview , help document , and corresponding publication . How to find proteins with similar domain architectures " provides a quick start in using the tool. -->
Entrez PubMed links to Conserved Domain Reviews (CDR) in Entrez Books : paragraph text...
_______ : paragraph text...
As you are viewing a search results page, click on the thumbnail image or title for any conserved domain model to see it's summary page. The thumbnail will show a snapshot of the 3D structure if a domain model includes a protein sequence(s) from a resolved 3D structure. The thumbnail will depict a sequence alignment if a domain model does not include protein(s) from a resolved 3D structure. A CD-summary page provides the following information for a domain model (example: cd00400 : voltage-gated chloride channel):
text summary (synopsis of function) links to the source database, literature references, and related data in Entrez, as available bioassay targets and results , as available statistics summarizing salient features of the domain model, such as number of protein sequence rows in the alignment, PSSM identifier, and more structure viewing options to display the 3D structure(s), if available, of protein sequences used to curate the domain model. conserved features ( available for NCBI-Curated domains only ) sequence cluster phylogenetic tree for protein sequences used to curate the domain ( available for NCBI-Curated domains only ) domain family hierarchy ( available for NCBI-Curated domains only ) multiple sequence alignments of the proteins used to develop the domain model.
More details about each section of the CD-summary page are provided below.
Text Summary (synopsis of function): The synopsis of the domain's biological function, shown at the top of a CD summary page, and was written by curators at the source database . The text summary in an NCBI curated domain also describes the taxonomic extent of the domain, whether it is a monomer or dimer, and any salient features. The text summary in a superfamily record is derived from the representative domain . The text summary shown at the top of a CD summary page was written by curators at the source database and provides a synopsis of the domain's biological function. In NCBI curated domains , it also describes the taxonomic extent of the domain, whether it is a monomer or dimer, and any salient features. The text summary in a superfamily record is derived from the representative domain .
BioAssay Targets and Results : A section entitled "BioAssay Targets and Results" appears on a conserved domain's summary page only if one or more members of the protein family have been used as targets in PubChem BioAssay records, and if at least one chemical was identified in the experiment(s) to be active against one of the targets defined by the domain family. As examples : view the conserved domain summary page for: cd09816: prostaglandin_endoperoxide_synthase (Animal prostaglandin endoperoxide synthase and related bacterial proteins) cd05061: PTKc_InsR (Catalytic domain of the Protein Tyrosine Kinase, Insulin Receptor) retrieve a concise list of conserved domain models with links to bioassays retrieve all conserved domain models with links to bioassays The "BioAssay Targets and Results" section lists bioassays that have tested the activities of small molecules (e.g., chemicals) against protein sequences that have a specific hit to this domain model, and are therefore considered to be members of this protein family. Some of the information from the bioassays may be generalizable to other members of the protein family, depending on how narrowly a family is defined. Up to three representative bioassays are listed as examples in the "BioAssay Targets and Results" box. Click on " Explore more " at the bottom of the box to see a complete list of experiments in the PubChem BioAssay database that have tested small molecules against protein targets belonging to this protein family. From there, you can open the "BioAssays" or "Compounds" folder tabs and click on the counts in the columns such as "active," "inactive," "tested" to see the chemicals that have been screened and their activity potency in the respective bioassays.
Item Description PSSM-ID the unique identifier for the position-specific scoring matrix ( PSSM ) generated by RPS-BLAST for a given multiple sequence alignment. If the sequence alignment changes in any way, for example, if new sequences are added to the alignment, a new PSSM will be generated and will receive a new PSSM-ID. (Note: Each superfamily record in the Conserved Domain Database also has a PSSM ID, which refers to the specific set of conserved domain PSSM IDs that comprise the superfamily, rather than to an actual position-specific scoring matrix for the overall superfamily.) View PSSM opens a separate window with a graphical view of the domain model's PSSM , showing the relative frequencies of various residues at each position of the domain model. This viewer was prepared as part of an NCBI course on Exploring 3D Molecular Structures . Aligned lists the number of rows in the sequence alignment consensus sequence )]] -->. In general, each row comes from a different sequence record. However, sometimes two or more rows can be from the same GI number (i.e., same sequence record), if the sequence contains multiple instances of the domain. Threshold Bit Score the domain specific threshold score (shown as a bit score ) that an RPS-BLAST hit must meet or exceed in order to be considered a specific hit , which represents a high confidence association between a protein query sequence and a conserved domain and therefore a high confidence level for the inferred function of the protein query sequence. The threshold is equal to the weakest E-value (and highest bit score) among self-hits of a domain�s member protein sequences to the resulting domain model ( illustrated example ). Domain-specific threshold scores are calculated only for NCBI-curated domains . Threshold Setting GI the GI number of the member protein sequence (i.e., the protein sequence from the domain model's multiple sequence alignment) that set the threshold bit score . A threshold setting GI number is displayed only for NCBI-curated domains , as thresholds are calculated as part of the curation process. Status information about the CD's curation status. Curated models have been realigned by NCBI with consideration of 3D structure. Alignments imported from outside sources have not been changed (except for the import process detailed above) Author name of the author who contributed the conserved domain model to the NCBI-curated data set. This line currently appears in records that were contributed by external collaborators (for example, cd08773 ). Mouse over the name to see a popup with additional contact information, if/as available, for the author. Created date at which the seed (or de-novo) alignment was imported into CDD Updated date of the most recent changes to the alignment model and/or descriptive information UI List _______
Item Description " Structure View " Button The "Structure View" button in a conserved domain record opens the 3D structure(s) , if available, of protein sequences used to curate the domain model. In order for the button to work, the Cn3D program must be installed on your computer. It is a a free helper application available for Windows, Macintosh, and Unix platforms. Installation takes only a couple of minutes and a tutorial describes the program's features and functions. In addition to displaying an interactive view of the 3D structure(s), Cn3D will also display the multiple sequence alignment of those and other proteins used in the curation of the domain model. The Cn3D structure view and sequence view windows communicate with each other, so highlighting residues in one window will also highlight those residues in the other window. As noted in the sections on the CD assembly process and unique features of curated domains , NCBI staff include protein sequences from resolved 3D structures (illustration) whenever possible in the multiple sequence alignment of a domain model. In a multi-level domain hierarchy , the 3D structures might be present in the parent node (e.g., cd00400 ) if they are not present in an intermediate or terminal node (e.g., cd03683 ). In that case, click on the parent node to view structures that have been specially annotated to highlight the conserved feature. You can click on any of the thumbnail structure images on a CD summary page to launch Cn3D. The thumbnail images in the conserved features summary box will launch a specially annotated view of the structure that highlights the particular feature of interest. However, 3D structures are not always available. If a domain model does not include any structure-based protein sequences , the "Structure View" button will still open Cn3D, but only the sequence viewer window will be populated with data. Controls in Cn3D will then allow you to manipulate the sequence alignment in various ways, if desired. For example, Cn3D offers column-specific coloring by sequence conservation when invoked with multiple alignment views. This is a convenient feature to study sequence conservation within a CD-alignment and to find out how well the aligned query fits the existing patterns of conservation and variability. The Cn3D tutorial provides more information on the controls available. Program Although the Structure View button provides the option of using an older version of Cn3D (3.0), the default choice is recommended because it uses the most recent public version of the program (currently Cn3D 4.1). Drawing Structures, when available, can be displayed in varying levels of detail. All Atoms will load a detailed model. This option transmits a large amount of structure data and loading the structures may therefore take some time. The Virtual Bonds setting displays C-alpha atoms only, with virtual bonds connecting them, and therefore transmits and loads more quickly. Aligned Rows By default, Cn3D will display a multiple sequence alignment of up to 10 proteins, starting with sequences whose 3D structures are shown, and then also including sequences from proteins that do not yet have a resolved structure. Use the "aligned rows" menu to increase that number up to 100 rows. " Download Cn3D " link _________ UI List _______
Conserved Features/Sites summary box ( available for NCBI-Curated domains only ): If conserved features/sites have been annotated on an NCBI-curated domain , they are noted in a summary box near the top of the page, with one folder tab for each feature (illustrated example) . Click on the folder tab for a feature of interest to view its details, such as: feature number - The ordinal number of the feature within the conserved domain model (e.g., feature 1, feature 2, etc.) feature name - Generally reflects the function of the conserved feature/site (e.g., Cl- selectivity filter, Cl- binding residues [ion binding site], dimer interface [polypeptide binding site]) conserved feature residue pattern - The set of amino acids that characterizes the conserved feature/site. The amino acids are not necessarily adjacent to each other in the domain model, but instead appear at the positions designated by hash marks (# symbols) in the multiple sequence alignment of the domain model, as described below. The pattern may include ambiguity codes (for example, [ST] , which indicates that a position can be occupied by either Serine or Threonine). The conserved feature residue pattern is specified by curators, and displayed on a conserved domain summary page, only if it is clear that specific residue types are necessary for the particular molecular function (such as metal coordination, glycosylation attachment, or enzyme catalysis). Therefore, not all conserved features/sites will include a "conserved feature residue pattern" line. Note: If a sequence in the Entrez Protein database gets a significant hit to the conserved domain model AND contains the conserved feature residue pattern, the site annotation will be transferred to that protein sequence record. evidence - may be free-text comments, literature citations, or "structure evidence" that exemplifies the existence of a site by highlighting an actual molecular complex in an experimentally resolved 3D structure. 3D structure thumbnail image, if available. The conserved amino acids that characterize the feature are highlighted in pink, in both the thumbnail image, and in the larger, interactive view of the structure that appears when you click on the thumbnail to launch the structure in the free Cn3D viewing program. (Cn3D must first be loaded on your machine in order for that to work.) Clicking the folder tab for a feature of interest will refresh the mutliple sequence alignment display with an extra alignment row that shows the feature number (feature 1, feature 2, etc.) and uses hash-marks (#) to indicate the specific residues involved (also shown in the illustrated example ). Only one feature at a time is shown in the multiple sequence alignment display.
Sequence Cluster Phylogenetic Tree ( available for NCBI-Curated domains only ): Based on evidence from sequence comparison, NCBI Conserved Domain Curators attempt to organize related domain models into phylogenetic family hierarchies ( details and illustration ). Colors used in the sequence cluster phylogenetic tree correspond to colors used in the domain family hierarchy display. Detailed View button on the CD summary page launches the sequence cluster view in a separate browser window, with more options for coloring and shading. To examine the hierarchical classification more closely, you can download the CDTree program, which is used by the NCBI curators. CDTree enables you to interactively view the complete domain hierarchy, including a detailed display of the sequence cluster tree. To view a query protein embedded into the sequence tree of a domain model, first use the CD Search tool to identify the conserved domains in the query sequence. Then click on the cartoon (colored bar) representing a domain of interest in either the Concise Display or Full Display of the CD-Search results page. That will open a CD Summary Page , which shows detailed information about the domain and provides an Interactive Display option for viewing the Hierarchy (an illustrated example is provided in the " How To " pages). To embed your query in the hierarchy, simply check the box for Add Query Sequence before pressing the "Interactive Display" button. (The free CDTree program must be loaded onto your computer in order for that button to work.) When the CDTree program opens, your query sequence will be highlighted in red . If the sequence tree is large, you might need to de-select the View/Fit to Screen option in CDTree's Sequence Tree window in order see the tree, and the placement of your query sequence, in detail. The CDTree help document is packaged with the software and provides details on how to use the program. Algorithms used to generate the cluster diagram in CDTree: The sequence tree viewer in CDTree calculates and displays sequence trees for a set of selected alignment models, which may or may not be linked in a hierarchical fashion. Sequence trees are the graphical depiction of results from simple phylogenetic analysis of the alignment data. Methods available for distance calculation are percent identity, Kimura-corrected percent identity, score of aligned residues, score of optimally extended blocks, blast score for the aligned footprints and blast scores for full-length sequences; a variety of commonly used scoring matrices can be selected. For the sequence trees displayed on CDD web pages, we commonly use "score of aligned residues", where pair-wise alignment scores derived from our multiple sequence alignments, and scored via BLOSUM62, are converted into distances. Trees can be constructed via single-linkage clustering, neighbor joining, or the Fast ME method. We use neighbor-joining for all of the sequence trees displayed on web-pages.
Domain Family Hierarchy ( available for NCBI-Curated domains only ): As noted in the description of NCBI curated domains , the goal of the curation project is to to provide CDD users with insights into how patterns of residue conservation and divergence in a family relate to functional properties. The CD summary page for an NCBI-curated domain shows the hierarchy ( details and illustration ) to which the currently viewed domain belongs. Some hierarchies have only one node, while others have many nodes organized into two or more levels. If a hierarchy has multiple nodes, you can click on another node of interest to view the CD summary page for that domain. Alternatively, you can download the CDTree program used by the NCBI curators in order to view the complete domain hierarchy interactively and in greater detail, with or without a query sequence embedded .
Multiple Sequence Alignment Displays : Member proteins used to create domain model: By default, the sequence alignment display at the bottom of a CD summary page shows 10 of the most diverse members reference sequence , as determined by BLAST) --> from the cluster of sequences used to create a domain model. (A sample multiple sequence alignment is shown in the illustration of cd00064 : Furin-like domain in this help document, or you can open a domain model directly in CDD, such as cd00400 : voltage-gated chloride channel.) The multiple sequence alignment display options (below) can be used to change the quantity and appearance of data displayed, and the CD-Search tool can be used if you'd like to embed a query sequence within the alignment. NCBI-curated , you can view the complete set of sequences with the CDTree program by pressing the "Interactive Display" button on the page. You must first load the CDTree program (free) onto your computer in order for the button to work. --> Protein query sequence embedded in alignment: To view a query protein embedded into the multiple sequence alignment of a domain model, first use the CD Search tool to identify the conserved domains in the query sequence. Then click on the cartoon (colored bar) representing a domain of interest in either the Concise Display or Full Display of the CD-Search results page. Display Options: By default , the multiple sequence alignment on a CD summary page is shown in hypertext format and displays up to 10 sequences that were used to curate the domain. The display format, number and type of sequence rows, and color scheme can be changed in the following ways: Format: text... Row display: text... Color Bits: text... Type Selection: text... Feature hash marks (#): text... Display Option Description Format Hypertext Interactive view in which each accession or GI number links to the corresponding complete sequence record in the Entrez Protein database. Displays all residues in each sequence row, with aligned residues shown in upper case , unaligned residues in lower case , and variation in sequence length shown as dashes . A horizontal scale indicates the number of residues in the overall alignment. The numbers at the beginning and end of each sequence row indicate the span of sequence data that was imported from the complete protein sequence record. Plain Text This view contains the same content as "Hypertext" but is rendered in ASCII format. Compact Hypertext Interactive view in which each accession or GI number links to the corresponding complete sequence record in the Entrez Protein database. Shows only aligned residues (as upper case letters), plus the number of intervening unaligned residues in each sequence row (shown in square brackets ). Does not show the unaligned residues themselves; those are shown only in the "Hypertext" and "Plain Text" format. Compact Text This view contains the same content as "Compact Hypertext" but is rendered in ASCII format. mFASTA Multiple FASTA (mFASTA) format is useful for importing the data into sequence analysis programs. For each sequence row in the alignment, it provides a FASTA -formatted definition line ("FASTA defline") followed by up to 80 characters of sequence data on each subsequent line. mFASTA format displays all residues in each sequence row, with aligned residues shown in upper case , unaligned residues in lower case , and variations in length filled in with dashes . Row Display Number of rows in a domain model The total number of sequence data rows aligned in a domain model are shown in the statistics portion of that model's CD summary page . Default number shown By default , 10 rows of sequence data are shown, including the representative sequence plus nine others. Maximum number shown You can change the number of sequence rows displayed using the Row Display pop-up menu. If the Type Selection is set to Most Diverse Members , a maximum of 100 rows can be displayed. If a domain model contains more than 100 rows , the Type Selection Top Listed Sequences allows the display of more than 100 rows. If a model is NCBI-curated , you can also use the CDTree program to view the complete set of rows. Simply install the program, which is free, then press the Interactive Display button in the hierarchy section of the domain model's CD summary page to view all the sequence rows. Note: In general, each row comes from a different sequence record. However, sometimes two or more rows can be from the same GI number (i.e., same sequence record), if the sequence contains multiple instances of the domain. Type Selection Most Diverse Members Lists the representative sequence followed by the most dissimilar protein sequences, as determined from the domain model multiple sequence alignment. They are listed from most to least dissimilar with respect to the representative sequence. Top Listed Sequences Merely refers to the order in which the sequences are listed in the multiple alignment; this may or may not be meaningful, depending on the approach used by the source database in curating a particular domain model. In NCBI-curated domain models , protein sequences from resolved 3-D structures are generally listed first, so the "Top Listed Sequences" display option is useful for bringing these structure-based protein sequences to the top when viewing NCBI-curated domains. The remaining sequences in NCBI-curated domain models are listed in order of increasing GI number or some other non-biological criterion. (This is because the composition of the member sequences, not their order, is important in determining a domain model's position-specific scoring matrix, or PSSM . The other important factor is the degree of residue conservation in any given column of the alignment, which can be visualized with the Color Bits setting, described below.) The biological relationships among the member sequences of an NCBI-curated domain model are displayed in the sequence cluster phylogenetic tree and the domain family hierarchy on the domain model's CD summary page. Both of these displays can also be viewed interactively using the CDTree program. Color Bits General Color Bits allow you to adjust the red blue balance of color used to depict the degree of conservation among aligned (upper case) residues . In general, red indicates highly conserved and blue indicates less conserved . (In other words, the two extremes on the color scale correspond to columns that are completely conserved (e.g., same residue in all alignment rows), and columns with residue types distributed in a way that is no different from the background distribution -- what would seem like a random pick of residue positions from arbitrary protein sequences.) Unaligned (lower case) residues are shown in grey . The color bit settings can be used to select a threshold for determining which columns are colored in red . Numerical settings Higher numbers require higher degrees of conservation within an alignment column (i.e., less residue variation) in order to display that column in red font. The score threshold that must be met in order for an alignment column to be displayed in red can be adjusted from a low of 0.5 to a high of 4.0. As the threshold increases, the number of columns shown in red will decrease. Background: Each column in the multiple sequence alignment display receives a score that indicates that column's "information content" -- its contribution to the overall alignment score -- indicating how important the column is as an "anchor" for the alignment. The higher the score, the more important that column is in the alignment. We use a fairly standard definition of "information content" for an aligned column: SUM (f(i) * log (f(i)/q(i)) over all base 2 residue types i where f(i) is the observed relative residue frequency, and q(i) is the background/reference relative frequency for that residue type (based on the table that accompanies the BLOSUM62 matrix). This is also called "relative entropy", which is a popular way to measure the distributions of nucleotide bases or amino acids. A column's score is calculated on the fly, based on the sequence rows currently shown in the display. As the number and type of sequence rows in the display change, the column's score, and therefore its color, can change . Identity setting The Identity setting uses red font only in columns that contain the same residue in all of the sequence rows displayed. All other aligned columns are colored in blue and unaligned columns are shown in grey . ___OptionName___ _________ _________ _________ _________ _________ _________ "Feature" hash marks (#) Hash-marks (#) in the top row of a multiple sequence alignment display indicate the specific residues involved in a conserved feature , such as a binding or catalytic site, that has been annotated on an NCBI-curated domain . Although multiple features may have been annotated, only one feature at a time is shown in the multiple sequence alignment display. A conserved features/sites summary box ( illustration ) lists the features that have been annotated. Clicking on the tab for a feature of interest will show its details. It will also refresh the mutliple sequence alignment display to mark the residues involved in the currently viewed feature (as depicted in the bottom of the illustration ). binding or catalytic sites , have been annotated on an NCBI-curated domain , they are noted in a conserved features/sites summary box (illustration) , with one folder tab for each feature . Clicking on a folder tab will show the details for that feature, and it will refresh the mutliple sequence alignment display with an extra alignment row that shows the feature number and uses hash-marks (#) to indicate the specific residues involved (also shown in the illustration ). Only one feature at a time is shown in the multiple sequence alignment display. -->
CDD is updated several times a year. We no longer try to follow updates of the source databases on a regular basis, but will re-import source database content occasionally. CDD continues to grow, however, through NCBI's curation effort. At the moment, CDD curators focus on capturing and describing hierarchies of related domain families, which are, for the most part, covered by the imported un-curated models as well. The current curation effort is restricted to ancient domain families with wide phylogenetic distribution, and focuses on families with at least one 3D structure representative.
The scientific community's understanding of molecular data continues to evolve as research progresses. Some domain models in CDD are generated through automated processes and others are curated . All are fluid and revised as new data become available and as new protein family clustering methods are developed. Because of this, we welcome your feedback on the data at [email protected] , including information/annotations you find particularly helpful as well as any discrepancies you may notice.
The CD-Search service is a web-based tool for the detection of conserved domains in protein sequences. It can therefore help to elucidate the protein's function. The CD-Search service uses RPS-BLAST to compare a query protein sequence against conserved domain models that have been collected from a number of source databases , and presents results as a concise display (default), standard display , or full display . If CD-Search finds a specific hit , there is a high confidence in the association between the protein query sequence and a conserved domain, resulting in a high confidence level for the inferred function of the protein query sequence. The other types of hits that can be found also shed light on the putative function of the query protein. The CD-Search tool can also identify putative conserved features in a query protein sequence , when such features can be mapped from the conserved domain annotations to the query sequence. If conserved features are found, they designated by small triangles in the search results graphical summary, indicating the specific amino acids likely involved in functions such as catalysis or binding.
The CD-Search service uses RPS-BLAST, which stands for "Reverse Position-Specific BLAST". This is a variant of the popular PSI-BLAST program ("Position-Specific Iterated BLAST"). PSI-BLAST finds sequences significantly similar to the query in a database search and uses the resulting alignments to build a Position-Specific Score Matrix ( PSSM ) for the query. With this PSSM the database is scanned again to eventually pull in more significant hits, and further refine the scoring model. RPS-BLAST uses the query sequence to search a database of pre-calculated PSSMs, and report significant hits in a single pass. The role of the PSSM has changed from "query" to "subject", hence the term "reverse" in RPS-BLAST. RPS-BLAST is the search tool used in the CD-Search service. The CD-Search service provides a web-interface to the RPS-BLAST program, the CD search databases, and interactive alignment visualization including 3D structures. The search results can include several types of RPS-BLAST hits that represent various confidence levels ( specific hits , non-specific hits ) and domain model scope ( superfamilies , multi-domains ). A standalone version of the RPS-BLAST program is available as part of the NCBI toolkit distribution. A separate section of this document describes the differences between the CD-Search web tool and the standalone RPS BLAST program.
Options: The options below are only available when using the actual CD-Search form. Searches launched from the CDD home page or together with protein BLAST requests use default search parameters. (The CDD home page does allow you to select the database, however.)
Database Selection: currently, CD-Search is offered with the following search databases. Note that if you use the default "CDD" database , CD-Search automatically returns precalculated search results , unless you select the option to " force live search ." If you select a database other than the default "CDD," the CD-Search program automatically uses the live search mode. CDD - this is a superset including NCBI-curated domains and data imported from Pfam, SMART, COG, PRK, and TIGRFAMs. It is the default database for searches. NCBI_Curated - NCBI-curated domains , which use 3D-structure information to explicitly to define domain boundaries, aligned blocks, and amend alignment details, and which aim to provide insights into how patterns of residue conservation and divergence in a family relate to functional properties. Pfam - a mirror of a recent Pfam-A database of curated seed alignments. Pfam version numbers do change with incremental updates. As with SMART, families describing very short motifs or peptides may be missing from the mirror. An HMM-based search engine is offered on the Pfam site. SMART - a mirror of a recent SMART set of domain alignments. Note that some SMART families may be missing from the mirror due to update delays or because they describe very short conserved peptides and/or motifs, which would be difficult to detect using the CD-Search service. You may want to try the HMM-based search service offered on the SMART site. Note also that some SMART domains are not mirrored in CD because they represent "superfamilies" encompassing several individual, but related, domains; the corresponding seed alignments may not be available from the source database in these cases. Note also that SMART version numbers do not change with incremental updates of the source database (and the mirrored CD-Search database). PRK - "PRK," short for Protein Clusters , is an NCBI collection of related protein sequences (clusters) consisting of Reference Sequence proteins encoded by complete prokaryotic and chloroplast plasmids and genomes. It includes both curated and non-curated (automatically generated) clusters. TIGRFAMs - a mirror of a recent TIGRFAMs set of domain alignments. COG - a mirror of the current COG database of orthologous protein families focusing on prokaryotes. Seed alignments have been generated by an automated process. An alternative search engine, "Cognitor", which runs protein-BLAST against a database of COG-assigned sequences, is offered on the COG site. KOG - a eukaryotic counterpart to the COG database. KOGs are not included in the CDD superset, but are searchable as a separate data set.
More information about each database is provided in the section on " Where does CDD content come from? " and the version number (as available) of each source database is provided in the CDD News page.
Maximum number of hits: limits the size of the hit list produced by CD-Search. Typically, for average sized proteins, the number of expected domain-hits is small and the default setting of 500 should be more than sufficient.
Expect Value (E-value): is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size E-value . -->. The E-value setting can be modified to adjust the statistical significance threshold used for reporting matches against PSSMs in the database. False positive results should be very rare with the default setting of 0.01 (use a more conservative, i.e. lower, setting for more reliable results). Results with E-values in the range of 1 and above should be considered putative false positives. Additional information about E-value is available in the Glossary of the NCBI Handbook and in the BLAST help document BLAST glossary . Note that the E-values you get (for any given protein query--conserved domain hit pair) on the CD-Search web service might differ from those you get when using standalone RPS-BLAST on your local PC. A separate section of this document describes the differences between the web service and standalone program and provides a tip on how you can generate the same results in standalone RPS-BLAST as those produced by the web service.
Composition-corrected scoring , which is employed by RPS-BLAST version 2.2.28 ( March 19, 2013 ) and up, abolishes the need to mask out compositionally biased regions in query sequences. This option is on by default . Note: In general, when composition-corrected scoring is on, the low complexity filter should be turned off. However, it is possible to have both options on at the same time (to filter false-positives that slip through the cracks of the composition-correction), or off at the same time (to find more distant relatives for compositionally biased queries), if desired.
Low Complexity Filter: filters query sequences for compositionally biased regions. These regions are flagged as such and largely ignored during the search phase if filtering is turned ON (the default setting is OFF). Note: In general, when the low complexity filter is turned on, the composition-corrected scoring should be turned off. However, it is possible to have both options on at the same time (to filter false-positives that slip through the cracks of the composition-correction), or off at the same time (to find more distant relatives for compositionally biased queries), if desired. If the low-complexity filter is turned on and compositially biased regions are detected, they are shown in the CD-Search output as cyan regions in the bar graphic that represents the query sequence, as illustrated below. More information about the low complexity filter is also available in the BLAST help document . If the low-complexity filter was ON for the search, the compositionally biased regions were NOT USED in the search against the domain database and are shown as SOLID cyan blocks . (As an example, open the default CD-Search results for P14780, GI 269849668, with filtering turned ON .) However, those regions may still overlap with or be included in a domain footprint and the pair-wise alignment generated by RPS-BLAST. If the low-complexity filter was turned OFF for the search, the compositionally biased regions were USED in the search and are shown as blocks OUTLINED in cyan . (As an example, open the CD-Search results for P14780, GI 269849668, with filtering turned OFF .) Please keep in mind, however, that compositionally biased regions can cause inaccurate annotation of the query sequence. If the low complexity filter DID NOT DETECT any compositionally biased regions in the query sequence, then it is displayed as a plain grey bar (with no cyan regions ), as shown in the illustrations of the sample concise display and full display of CD-Search results.
Force Live Search - Use this option if your query is a GI or accession number of a protein sequence already in the Entrez Protein database and you prefer to see live rather than precalculated CD-Search results. Note that precalculated searches use default parameters (options) , while live searches allow you to change those parameters, if desired. Normally, CD-search will display precalculated search results for queries that contain a GI or accession number of a sequence already in the Entrez Protein database. This is because CD-Searches are done as part of the automated processing of the Entrez Protein database, and the stored search results are readily available. If that is true for your query, the " BLAST search parameters " information shown at the bottom of a Full Display of search results will say: Data Source: Precalculated Data , and will show the parameters (options) database selection , composition-corrected scoring , low complexity filter , expect value (E-value) , and maximum number of hits ) --> that are used by default. A Live search is done automatically IF : (a) your query is a FASTA formatted sequence, and the FASTA defline does not include a GI or accession number of a sequence record in Entrez protein, or (b) your query includes a GI or accession number but you selected a search database other than the default CDD or you changed any other parameters (options) from their default settings. If a Live Search was done, the " BLAST search parameters " information shown at the bottom of a Full Display of search results will say: Data Source: Live blast search, RID = XXNNNXXNNN . The RID is a "request ID" and will enable you to retrieve the results of that particular search for the next 36 hours. The display will also show the search parameters database selection , composition-corrected scoring , low complexity filter , expect value (E-value) , and maximum number of hits ) --> that were applied. Search Input : Retrieve previous search with RID# . -->
Rescue Borderline Hits: This option allows you to see hits that have an E-value above the RPS-BLAST reporting threshold (anywhere between 0.01 and 1.0), and that are consistent with known domain architectures. A rescued hit is displayed with a dashed border , and its e-value is displayed in red , as shown in the illustrated example below. Additional details about rescued hits are provided in: Derbyshire MK, Gonzales NR, Lu S, He J, Marchler GH, Wang Z, Marchler-Bauer A. Improving the consistency of domain annotation within the Conserved Domain Database. Database (Oxford) 2015 Mar 12; 2015. pii: bav012. doi: 10.1093/database/bav012. Print 2015. [PubMed PMID: 25767294] [Full Text at Oxford Journals] [Full Text in PubMedCentral]
Suppress Weak Overlapping Hits: This option suppresses hits that have an e-value close to the RPS-BLAST reporting threshold (in between 0.01 and 0.001) but overlap with stronger hits. A suppressed hit is displayed with a strikethrough , as shown in the illustrated example below. Additional details about suppressed hits are provided in: Derbyshire MK, Gonzales NR, Lu S, He J, Marchler GH, Wang Z, Marchler-Bauer A. Improving the consistency of domain annotation within the Conserved Domain Database. Database (Oxford) 2015 Mar 12; 2015. pii: bav012. doi: 10.1093/database/bav012. Print 2015. [PubMed PMID: 25767294] [Full Text at Oxford Journals] [Full Text in PubMedCentral]
Result Mode: allows you to select the level of detail displayed in the search results: Concise mode ( illustrated example ) shows only the best scoring domain model, as available for each region on the query sequence. Standard mode ( illustrated example ) shows the best scoring domain model from each source database, for each region on the query sequence. Full mode ( illustrated example ) shows all hits for each region on the query sequence. Once you are viewing the search results, you can use the "View Full Result/View Concise Result" button in the upper right corner of the search results to toggle between the two views.
Retrieve a previous CD-Search result by entering its Request ID (RID): A Request ID (RID) is assigned to a CD-Search if it was done as a live search . In such a case, the " BLAST search parameters " information shown at the bottom of a Full Display of search results will say: Data Source: Live BLAST search, RID = XXNNNXXNNN . The RID enables you to retrieve the results of that particular search for the next 36 hours by entering that number in the "Retrieve previous CD-Search result" section of the CD-Search page. (RIDs are not assigned to CD-Searches that use precalculated results.) CD-Search page. (RIDs are not assigned to CD-Searches that use precalculated results.) --> CD-Search page. (RIDs are not assigned to CD-Searches that use precalculated results.) -->
The CD-Search results page provides the following display options and information for the conserved domains that align to your query sequence: three levels of detail in CD-Search results displays : concise results | standard results | full results types of RPS-BLAST hits : specific | non-specific | superfamily | multidomain display elements : protein classification | domain colors/shapes | jagged edges (partial matches) | double-headed arrows (structural motifs) | small triangles (conserved features/sites) | compositionally biased regions display controls : horizontal zoom | zoom to residue level | refine search | search for similar domain architectures tabular list of domain hits
Three Levels of Detail in CD-Search Results Displays:
A pull-down "View" menu in the upper right corner of a search results page allows you to to select the desired view: Concise Results , Standard Results , or Full Results . This enables you to control the level of detail shown in both the Graphical Summary (shown in the illustrations below) and Tabular List of Domain Hits (not shown in the illustrations below for brevity, but available on the actual, interactive CD-Search results page for the example featured in the illustrations).
CD-Search results can include up to four hit types that represent various confidence levels ( specific hits , non-specific hits ) and scope ( superfamilies , multi-domains ) of domain hits. In the search result displays, hits are ranked by E-value , although NCBI-curated models are ranked ahead of other hits to the region if their E-value exceeds a threshold of 1e-05.
If the protein query sequence contains compositially biased regions, those will be detected by the low-complexity filter and shown in the graphical output as cyan regions in the grey bar graphic that represents the query sequence ( illustrated example ). The filter can be turned ON or OFF (default) when you submit a query by using the options on the search form.
The Concise display is the default output for CD-Search results and shows only the best scoring domain model , as available for each region on the query sequence, in each of three hit types : specific hits , the superfamily to which the highest-ranking hit belongs, and multi-domain models . In addition, small triangles ( illustrated example ) indicate the amino acids involved in conserved feaures/sites, such as catalytic and binding sites, when such annotations are available in a domain model.
If CD-Search finds both specific and non-specific hits for a region of a protein query sequence, only the highest ranking specific hit and its superfamily will be shown. If CD-Search finds only non-specific hits for a region of a protein query sequence, only the superfamily to which the hits belong will be shown, but not the non-specific hits themselves. The latter are provided only in the full display .
The top-scoring multi-domain model is shown in the concise display only if: (a) it meets or exceeds the specific hit threshold , OR (b) if it does not overlap with a specific hit or superfamily annotation and if ≥50% =50% --> of the domain model's length aligns to the query protein sequence . If the top-scoring multi-domain model does not meet the 50% length threshold, it is shown on the concise display only if there is no specific hit or superfamily annotation on that query sequence region at all.
The Standard result lists the best scoring domain model from each source database , as available for each region on the query sequence hit types : specific hits , the superfamily to which the highest-ranking hit belongs, and multi-domain models . . In some cases, two NCBI-curated models might be shown for a given region of a protein, if the immediate parent of the highest ranking NCBI-curated conserved domain model is also in the search results. (A separate section of this document provides more information about domain family hierarchies .) The top-scoring multi-domain model from each source database is also shown. specific hit threshold , OR (b) if it does not overlap with a specific hit or superfamily annotation and if ≥50%[ >=50% ] of the domain model's length aligns to the query protein sequence [single domain or superfamily region and meets the 50% length threshold]. If the top-scoring multi-domain model does not meet the 50% length threshold, it is shown on the concise display only if there is no specific hit or superfamily annotation on that query sequence region at all. -->
A separate section of this help document provides more information about the [ small triangles that represent conserved features/sites, and the] colors/shape combinations used for the domain cartoons.
In addition, small triangles ( illustrated example ) indicate the amino acids involved in conserved feaures/sites, such as catalytic and binding sites, when such annotations are available in a domain model.
As an example , open the current, interactive CD-Search: Standard results page for protein GI 157830769 .
The Full display shows all domain models , as available for each region on the query sequence, that meet or exceed the RPS-BLAST threshold for statistical significance (i.e., the E-value cutoff ). The hit types can include specific hits , non-specific hits , the superfamily(ies) to which those hits belong, and multi-domain models . Hits are ranked by E-value , although NCBI-curated models are ranked ahead of other hits to the region if their E-value exceeds a threshold of 1e-05. In addition, small triangles indicate the amino acids involved in conserved feaures/sites, such as catalytic and binding sites, when such annotations are available in a domain model.
The bottom of the Full Display ( not shown in the image below but viewable by clicking on that image to open the actual, interactive CD-Search results page ) also provides a summary of BLAST search parameters , which includes information such as the database which you searched against , whether the low complexity filter was used, the expect value (E-value) threshold , the BLAST software version number, and whether RPS-BLAST did a live search or retrieved precalculated search results. If a live search was done, the BLAST Request ID (RID) is also shown in the "BLAST search parameters" section and allows you to retrieve the search results by RID anytime within 36 hours following the search, without having to re-execute it. (Note: Only the top portion of the full display is shown in the image below, illustrating the components of the graphical summary. To see the complete display, including the List of Domain Hits and BLAST search parameters, click on the image below in order to open the actual, interactive CD-Search results page.)
Types of RPS-BLAST hits: CD-Search results can include hit types that represent various confidence levels (specific hits, non-specific hits) and domain model scope (superfamilies, multi-domains). They can be seen in both the Concise display and Full display , except for non-specific hits, which are shown only in the Full Display.
Specific hit is the top-ranking RPS-BLAST hit (compared to other hits in overlapping intervals) that meets or exceeds a domain-specific E-value threshold (details and illustration) . It represents a very high confidence that the query sequence belongs to the same protein family as the sequences used to create the domain model, and therefore a high confidence level for the inferred function of the protein query sequence. Non-specific hits meet or exceed the RPS-BLAST threshold for statistical significance (default E-value cutoff of 0.01, or an E-value selected by the user with advanced search options ). (NOTE: Non-specific hits are shown only in the full display (illustration) of search results. In contrast, the concise display (illustration) shows only the superfamily to which the top-scoring non-specific hit for a given sequence region belongs.) Superfamily is the domain cluster to which the specific and/or non-specific hits belong. This is a set of conserved domain models that generate overlapping annotation on the same protein sequences and are assumed to represent evolutionarily related domains. (See additional details, including information about clustering methodology, under " What is a superfamily? ") In the Concise Display , if a region of the query sequence has only non-specific hits to domain models from a given superfamily, only the superfamily footprint will be displayed -- not the individual superfamily members to which the query sequence had non-specific hits. To see the latter, view the Full Display of search results. In that display, the width of the box that encloses superfamily members is determined by the alignment span of the highest scoring superfamily member. Multi-domains are domain models that were computationally detected and are likely to contain multiple single domains. They are typically shown as grey -colored bars. (Examples are shown in the concise display and full display illustrations.)
A number of display elements are used to graphically convey conserved domain annotations on the query sequence. Those elements are used in all three views of search results: Concise Results , Standard Results , or Full Results . The display elements, described below, include: protein classification domain colors/shapes jagged edges (partial matches) double-headed arrows (structural motifs) small triangles (conserved features/sites) compositionally biased regions The Tabular List of Domain Hits , which appears beneath the graphical summary of search results, provides additional details about, and viewing options for, each conserved domain model that has been mapped to your query sequence. CD-Search results page for protein GI 157830769 , Cyclodextrin Glucanotransferase, which is the query sequence featured in the illustrations of the concise results , standard results , and full results .) -->Each domain model's accession number links to the corresponding record in the Conserved Domain Database , where the model and its member sequences can be launched in free software such as the Cn3D structure viewing program, or the CDTree alignment viewer/editor.
Protein Classification: How is the protein classification determined? A protein classification is shown in the CD-Search results, when possible, and provides a functional characterization of the conserved domain architecture found in the protein query ( illustrated example ). The protein classification section appears in CD-Search results only if we have a curated label for the protein family to which the query sequence belongs. A domain architecture is defined as the sequential order of conserved domains in a protein, and the architectures are computationally identified by the Conserved Domain Architecture Retrieval Tool (CDART) . A tool called SPARCLE (Subfamily Protein Architecture Labeling Engine) is used to label the proteins that contain a given architecture. Each conserved domain architecture can be assigned a unique, functional name based on the composition of the architecture. The names are assigned either manually, through a curation process , or computationally, by an autoname algorithm or a namedbydomain algorithm. The domains used for architectures may include ancient superfamilies (like ATPase) or much more recently evolved protein subfamilies (like RAS). In the case of curated architectures , the functional characterization of each architecture is written by Conserved Domain Database Curators , based on a review of the publications associated with the proteins that contain the domain architecture. To given an example of proteins that have similar function but different domain architectures: DNA gyrase B (NP_387887) , an antibiotic target, has a conserved domain architecture that includes a histidine kinase-like ATPase domain, a transducer domain, a topoisomerase-primase domain, followed by a type II topoisomerase carboxy domain . In contrast, enzymes of similar function, such as topoisomerase IV (Q45066) , have a different domain architecture . Note : in each of the examples above, the default graphic that appears when you click on the architecture link depicts the full length protein model; click on the option to " View: Full Results " link in the upper right hand corner of the display to see the individual conserved domains that compose the full length protein model. Also, in each example, you can follow the " domain architecture ID xxxxxxx" link that appears in the " Protein Classification " section of the display to open the corresponding SPARCLE record . The SPARCLE record, in turn, lists the evidence that was used to name the architecture and contains links to other protein sequences that have the same architecture . There are several types of architectures: superfamily architectures - domain architectures consisting solely of superfamilies . subfamily architectures - domain architectures that mix superfamilies and subfamilies (i.e., conserved domain models that get a specific hit to the protein query sequence). Note: It is also possible for a domain architecture to consist of a single conserved domain footprint . ______ - _______ ______ - _______ ______ ______ Separate sections of this help document provide additional information about domain family hierarchies , and the hit types you see in CD-Search results, such as specific hits , non-specific hits , the superfamily(ies) to which those hits belong, and multi-domain models . Each superfamily is represented by a cartoon with a distinct color/shape combination , in order to distinguish domains from each other. The SPARCLE Help document provides additional information about that resource, including an overview , examples of how SPARCLE can be used to learn more about proteins, allowable input , a description of the search output and the contents of a SPARCLE record , and details about the data processing pipeline.
Jagged Edges: What do domain cartoons with jagged edges mean? Occasionally domain-cartoons have jagged edges ( illustrated example ). This means that the alignment found by RPS-BLAST omitted more than 20% of the CD's extent at the n- or c-terminus (or both, as indicated by the cartoons). This feature may give hints towards truncated query sequences, false-positive hits, or unusual domain architectures involving long insertions. The exact percentage of the CD's extent used in the alignment is listed in detail in the pairwise alignment section.
Double-headed Arrows (structural motifs) What do the double-headed arrows mean in the Graphical Summary? Double-headed arrows appear on a CD-Search results graphical summary only if the query protein contains structural motifs . Structural motifs are regions in proteins and protein domains that are too small to be modelled as individual evolutionarily conserved domains and too extensive to be characterized as conserved features/sites . They play a structural and/or functional role that CDD curators chose to document, as their presence contributes to functional annotation and/or protein classification. Structural motifs are particularly useful in annotating the locations of specific repeats. Examples are blades in beta-propeller structures ("closed solenoid proteins"), super-helical repeats such as Armadillo and HEAT ("open solenoid proteins"), various zinc fingers, various calcium-binding motifs, coiled coils, or transmembrane segments. The structural motifs cannot be modeled as evolutionarily conserved domains because the properties of the PSSMs as search models require a minimum length to be effective (exacerbated by the fact that many of the structural motif regions have compositional bias ), and because the evolutionary history of most of these structural motifs is not clear enough to enable a representation of that history.
Small Triangles What do the small triangles mean in the Graphical Summary? The small triangles beneath the query protein on a CD-Search results page indicate the residues that comprise conserved features/sites , such as binding or catalytic sites, as mapped from the conserved domain annotations to the query sequence. An illustrated example is below. The triangles appear if a region of the query protein sequence either: gets a specific hit to an NCBI-curated domain domain model on which conserved features/sites have been annotated. In such a case, the conserved features/sites that have been annotated on the domain model will be mapped to the query sequence. OR gets a specific hit to a domain model from an external source that belongs to a superfamily which also contains NCBI-curated hits that align to the query sequence. In such a case, the conserved features/sites from the superfamily representative will be mapped to the query sequence. (Technically, they will be mapped from the superfamily representative to the best-scoring NCBI-curated domain that is a non-specific hit, and then from that hit to the protein query sequence.) OR gets a non-specific hit to an NCBI-curated domain model that belongs to a superfamily whose representative has conserved/feature site annotations. In such a case, the conserved features/sites from the superfamily representative will be mapped to the query sequence. (Note that the non-specific hit will not appear on the concise display of the CD-Search results -- only the site annotations will appear there. View the full display to see both the triangles and the hit.) The triangles are shown in the same color as the domain on which they have been annotated. Click on the triangles to view details about the feature, including a multiple sequence alignment of your query sequence and the protein sequences used to curate the domain model, where hash marks (#) above the aligned sequences (illustration) show the location of the conserved feature residues. A thumbnail image , if present, provides an approximate view of the feature's location in three dimensions and options for interactive 3D structure viewing . thumbnail image , you can click on that to see an approximate view of the feature's location within the protein in 3 dimensions, with options for interactive 3D structure viewing. --> conserved features (also referred to as "sites" or "site features") annotated on them. The triangles point to the individual residues that comprise a conserved feature, as mapped from the conserved domain annotations to the query sequence. The triangles are shown in the same color as the domain on which they have been annotated. --> Conserved features/sites, if present, are shown by default in the graphical display. If desired, they can be hidden by clicking on show options in the graphical summary header bar, then deactivating the show site features checkbox and pressing the update button.
Compositionally Biased Regions: On the CD-Search results page, what do the cyan regions mean in the bar graphic that represents the query sequence? These represent compositially biased regions detected in the query sequence by the low-complexity filter . If the low-complexity filter was ON for the search, the compositionally biased regions were NOT USED in the search against the domain database and are shown as SOLID cyan blocks . (As an example, open the default CD-Search results for P14780, GI 269849668, with filtering turned ON .) However, those regions may still overlap with or be included in a domain footprint and the pair-wise alignment generated by RPS-BLAST. If the low-complexity filter was turned OFF (default) for the search, the compositionally biased regions were USED in the search and are shown as blocks OUTLINED in cyan . (As an example, open the CD-Search results for P14780, GI 269849668, with filtering turned OFF .) Although compositionally biased regions can cause inaccurate annotation of the query sequence, their effect is ameliorated to a great extent by composition-corrected scoring , which is turned on by default. If the low complexity filter DID NOT DETECT any compositionally biased regions in the query sequence, then it is displayed as a plain grey bar (with no cyan regions ), as shown in the illustrations of the sample concise display and full display of CD-Search results.
The CD-Search results display can be customized with the following controls: horizontal zoom zoom to residue level refine search search for similar domain architectures The Tabular List of Domain Hits , which appears beneath the graphical summary of search results, provides additional details about, and viewing options for, each conserved domain model that has been mapped to your query sequence. CD-Search results page for protein GI 157830769 , Cyclodextrin Glucanotransferase, which is the query sequence featured in the illustrations of the concise results , standard results , and full results .) -->Each domain model's accession number links to the corresponding record in the Conserved Domain Database , where the model and its member sequences can be launched in free software such as the Cn3D structure viewing program, or the CDTree alignment viewer/editor.
Horizontal Zoom If a query sequence is very long and contains many domains (e.g., human titin isoform N2-B, gi 291045223 ), the details of the graphical summary might be difficult to read. In that case, you can click on show options in the graphical summary header bar, enter the desired magnification level in the horizontal zoom box, and press the update button to refresh the display. There is no specific maximum value that can be entered in the horizontal zoom box. Rather, the limit is determined by the pixel width of the graphic image displayed. If the zoom value you enter is too large, the system will display the message: "invalid zoom factor". In that case, enter a smaller zoom value. There might be other cases in which the zoom value is acceptable but it takes some time to generate the display. In such cases, you might get an option to stop script or continue . Choose the latter if you would like the process of generating the enlarged graphic display to continue. Zoom to Residue Level This option displays the amino acids ("residues") in the query sequence . It also highlights the amino acids in the query that are mapped to conserved features/sites , which are denoted by small triangles in the graphical summary. As an example of the "zoom to residue level" view, see the human regulator of G-protein signaling 12 isoform 2 . When you activate the "zoom to residue level" setting, the " horizontal zoom " text box (which is visible when you press "show extra options") will still retain the zoom value that was used before the "zoom to residue level" option was activated. This makes it possible to easily toggle between the residue level view and the previous zoom level. Note: The "show extra options/horizontal zoom" text box will generally contain the default value of 1, unless you viewed a different magnification before zooming in to the residue level. The actual horizontal zoom level that is applied by the CD-Search program when the "zoom to residue level" option is checked varies based on length of sequence and is determined automatically by the program. Note about display limits: If a protein sequence is very long, the individual amino acids might not be visible when the "zoom to residue level" option is checked. This is because images wider than approximately 35000 pixels cannot be displayed correctly in browsers. Therefore, the "zoom to residue level" option limits the display to 35000 pixels wide. If the query sequence is very long (as an example, see human titin isoform N2-B , 26,926 amino acids long), the program will still draw the residues, but they will be squeezed together and will not be easily readable as individual letters. human titin isoform N2-B , which is 26,926 amino acids long. -->
Refine Search The Refine Search button on a CD-Search results page allows you to modify your query to search against a different database and/or use advanced search options .
Search for Similar Domain Architectures The Search for Similar Domain Architectures button on a CD-Search results page retrieves proteins that contain one or more of the domains present in the query sequence, using the Conserved Domain Architecture Retrieval Tool, "CDART" ( illustrated example ).
Tabular List of Domain Hits Beneath the CD-Search results graphical summary is a Tabular List of Domain Hits . This table appears in all three views: Concise Display , Standard Display , and Full Display . When you mouse over any conserved domain footprint in the graphical summary, the corresponding CD accession number and description will be highlighted in the tabular list of domain hits. If a domain model aligns to more than one region of the query sequence, it will be listed multiple times in the tabular list of domain hits. This is true because the alignment coordinates and score of the domain model vary among different regions of the query sequence, and each hit is reported separately. (As an example, see the CD-Search results page for protein GI 157830769 , Cyclodextrin Glucanotransferase, which is the query sequence featured in the illustrations of the concise results , standard results , and full results .) Click on the [+] to the left of the CD accession to see a pairwise alignment of your query sequence and the consensus sequence for the domain model. (Residues that are identical between your query sequence and the consensus sequence are shown in red .) Click on the CD accession number to view the domain model's summary record in the Conserved Domain Database (CDD) . If you'd like to see your query sequence embedded in the domain model's multiple sequence alignment , click on the domain footprint for any specific hit , non-specific hit , or multidomain of interest in the graphical portion of the concise or full CD-search results page. That will open the Entrez CDD record for the domain model, with your query sequence embedded in the model's multiple sequence alignment . In that view, you can change the color bits setting to increase or decrease the threshhold that determines which columns of the alignment are displayed in red . (Note: Superfamily records do not include a multiple sequence alignment display, so if you click on the footprint of any superfamily, you will see a CDD summary page that provides general information about the superfamily and lists the domain models that belong to it. Only the individual domain models will have multiple sequence alignments, and you must click on the footprints of those models in the graphical summary of a CD-Search results page (not on the superfamily's CDD summary page) in order to see your query sequence embedded in the alignment.) The tablular list of domain hits also provides a link from each domain model's accession number to the corresponding record in the Conserved Domain Database , where the model and its member sequences can be launched in free software such as the Cn3D structure viewing program, or the CDTree alignment viewer/editor.
Graphical Summary: Default Display: Elements of the display include the following, as appropriate, depending on the types of hits found for a given query sequence. A concise display is shown by default, and the global options enable you to view the full display , if desired. Query Sequence: text... Domain Footprints (colored bars): text... Jagged Edges: text... Grey Bars (multi-domains): text... Options: The " options >> " link in the "Graphical Summary" header bar reveals the following choices: Show Site Features: text... Horizontal Zoom: text...
List of Domain Hits: [paragraph text] Description: text description. PSSMid: text description. PSSM Multi-Domain: text description. E-value: text description. Pairwise alignments: text description.
__Subheader_with_arrowup___: paragraph text... _____: text description. _____: text description. _____: text description. _____: text description. _____: text... _____: text...
A specific hit is a high confidence association between a protein query sequence and a conserved domain, resulting in a high confidence level for the inferred function of the protein query sequence. It is one of four types of RPS-BLAST Hits . (See illustrations of CD-Search results concise display and full display for examples.)
In order to be considered a specific hit , an alignment of a domain model to a query protein sequence must meet two criteria :
The domain model must be either: (1) the top-ranked (best E-value ) NCBI-Curated domain , or (2) the top-ranked domain model from an external source , if there is no NCBI-curated domain that meets all the criteria for a specific hit. If domain models from both the NCBI-curated data set and external sources meet a domain-specific threshold , the NCBI-Curated domain domain will be listed preferentially as the specific hit because it has been annotated with fine-grained evolutionary relationships , conserved sequence blocks, specific functions, and conserved features/sites based on careful review of sequence data , 3D structures , and literature . conserved features , noting the specific residues within the domain that are involved in catalysis or binding. --> However, if no NCBI-curated domain meets the criteria for a specific hit, then the top-ranked domain model from an external source will be shown in the CD-Search results concise display if it meets all the criteria for a specific hit. The E-value of the RPS-BLAST hit must be equal to or lower than a domain-specific threshold E-value. The domain-specific threshold is the weakest E-value obtained when each of the protein sequences used to curate a domain are RPS-BLAST'ed against that domain's Position-Specific Scoring Matrix (PSSM) . In other words, the threshold is the weakest E-value among self-hits of a domain�s member protein sequences to the resulting domain model. The illustration below provides an example, showing the domain-specific threshold for cd03683, ClC-1-like chloride channel proteins. Domain-specific threshold scores NCBI-curated domains , and --> are displayed (in the form of bit score ) in the statistics box of a domain model's CD-summary page .
If a specific hit IS found on a protein sequence, then: There is a high confidence level that the query protein sequence is a member of the protein family represented by the domain model and has the specific function annotated on that domain. If the query sequence resides in the Entrez Protein database, the inferred function is annotated as "region" on the protein sequence record, showing the name of the high-scoring domain model and its base span. If the specific hit is to an NCBI-curated domain model that includes conserved features (residues involved in catalysis or binding), those are annotated on the protein sequence record as "sites." If the specific hit is to a domain model from an external source , and the model belongs to a superfamily whose representative is an NCBI-curated domain that has such annotations, then the conserved features/sites that have been annotated on the superfamily representative will be mapped to the query sequence.
If a specific hit IS NOT found on a query protein sequence, but the protein has an otherwise statistically significant hit (E-value cutoff of 0.01) to any domain model in CDD , the domain model is regarded as a non-specific hit . In that case: The general function of the domain superfamily can be inferred for the query protein sequence, but the specific function is less certain. If the query protein sequence resides in the Entrez Protein database, the name and general function of the domain superfamily is annotated in the protein sequence record (as a "region" ). The name and function text is derived from the domain model which has been selected as the superfamily representative . Conserved features ( "sites" ) are also annotated on the protein sequence record if the superfamily representative is an NCBI-curated domain that has such annotations.
Click anywhere on the image to open the complete, interactive record for this domain model (cd00400) in the Conserved Domain Database (CDD). * Domain-specific threshold scores are NCBI-curated domains and are domain-specific threshold bit score displayed in the statistics box of a domain model's CD summary page. In the actual calculation of domain-specific thresholds, bit scores are used rather than E-values . (A bit score is defined in the NCBI Handbook glossary BLAST glossary and Field Guide glossary .) NOTE: The image above reflects the cd03683 domain alignment as of April 20, 2008 . The scientific community's understanding of molecular data continues to evolve as research progresses, and as new as well as updated sequence data are regularly deposited into the databases. If a member sequence used in a domain alignment is later superceded by an updated version, the new sequence data and gi number will replace the old one during review/update cycles of curated domains. Some revisions to sequence data, such as upstream or downstream extensions, do not affect the domain model, but the gi number and amino acid span will change to reflect the updated sequence record.
When you click on the cartoon (colored bar representing a domain footprint) in the graphical display on the CD-search results page, an alignment view will be opened, which adds the query sequence to the multiple CD-alignment. It is possible to modify the number and type of sequences shown, as described in the help document section on CDD Record : multiple sequence alignment displays Display options .
If you display an alignment view that includes a query sequence, you can also view the same alignment in the Cn3D program by pressing the Structure View button. (Cn3D installation takes only a couple of minutes and a tutorial describes the program's features and functions. The program must be installed in order for the Structure View button to work.) If a protein sequence from a 3D structure is included among the sequences used to curate a domain model, Cn3D will show the 3D structure as well. If the domain model includes sequences from more than one 3D structure, all of the structures will be displayed, superimposed upon each other, and their sequences will be displayed in the multiple sequence alignment. Cn3D offers column-specific coloring by sequence conservation when invoked with multiple alignment views. This is a convenient feature to study sequence conservation within a CD-alignment and to find out how well the aligned query fits the existing patterns of conservation and variability.
CD-search requests are submitted to the BLAST servers immediately. A typical search should take a few seconds only, depending on the size of the search database chosen, the length of the query sequence, and the load on the servers. Click here to test response time with a typical query. CD-Search requests can also be sent to the BLAST Queuing system (this happens by default for searches launched in parallel with protein BLAST requests), use the optional button at the bottom of the CD-Search page. Requests sent to the query will take longer, but the results can be retrieved at a later time using the RID ("Request ID"), without having to re-calculate the search. A form at the bottom of the CD-Search page can be used to retrieve earlier search results by RID.
When CD-search is run as an integral part of protein-BLAST search requests, the jobs are put in the BLAST queue and may take a little longer to complete (depending on the system load and length of query sequence). Queued CD-search will try to retrieve the finished results every few seconds until they are available. You may also store the request-id (RID) and retrieve results later here .
Yes, you can run RPS-BLAST locally. A standalone version of RPS-BLAST is packaged in with the BLAST executables available on the NCBI FTP site, and is also available as part of the NCBI toolkit distribution (see ftp://ftp.ncbi.nih.gov/toolbox ). Separate directories on the FTP site provide documents that describe each of the BLAST applications, including documents for RPS-BLAST and a Formatrpsdb application that can be used to build search databases that are properly formatted for use with RPS-BLAST. Pre-formatted search databases, which have already been processed by Formatrpsdb, are available on the CDD FTP site . A README file on the CDD FTP site also provides more details about customizing search databases. FTP (see the README file for instructions). -->
There are several differences between the CD-Search web service and standalone RPS-BLAST , as distributed by NCBI and used with search databases as distributed by the CDD group. The web server is optimized for the most common use of the CDD resource, which is to annotate protein sequences with clearly identified and well understood protein domains, and is also optimized for speed in order to accomodate a high volume of searches. As part of the optimization, we use some different statistical parameters for the web service than for the standalone RPS-BLAST application. Specifically, we use a constant, assumed search "database size" setting on the web server for calculating E-values . This means that the actual size of the search database can change (we are adding new models every few weeks), but the E-value computed for any individual GI -- PSSM match will remain constant. This approach: (a) ensures that pre-calculated residues are not dependent on the actual size of the model collection (which is redundant and mostly grows by increasing that redundancy); (b) facilitates incremental updates of pre-computed sequence annotation with conserved domains; and (c) is used for the creation of protein-CDD links . In contrast, standalone RPS-BLAST does not employ the constant, assumed database size parameter. So when you use a search set downloaded from the CDD FTP site, the database size might be different than the one used by the CD-Search web service, and the same hit of your query protein to a model will receive a different E-value in the standalone result. For example, if the size of the FTP'ed database is smaller than what the CD-Search web service assumes in its database size parameter, the same hit of your query protein to a model will receive a lower E-value in the standalone. Conversely, if the size of the FTP'ed database is larger than what the CD-Search web service assumes in its database size parameter, the same hit of your query protein to a conserved domain model will receive a higher E-value in the standalone. If you want standalone RPS-BLAST to use the same database size parameter that is used for the web server (and thereby reproduce the same E-values with standalone RPS-BLAST that are generated by the web service), you can do that by creating an "alias" file on your local computer and placing it in the same directory as the standalone RPS-BLAST executable. The file can have a name such as "mycdd.pal" and can have contents such as the following (where lines starting with "#" are comments): # # RPSBLAST alias file # TITLE mycdd # DBLIST ./Cdd # STATS_TOTLEN 13521388 STATS_NSEQ 59695 This will now let you search against the database named "Cdd" using the two search set size parameters as specified, e.g.: ~$ rpsblast -i rpstest.tfa -d mycdd -F T -e 0.01 -m 9 # RPSBLAST 2.2.26 [Sep-21-2011] # Query: gi|156356500|ref|XP_001623960.1| predicted protein [Nematostella vectensis] # Database: mycdd # Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score gi|156356500|ref|XP_001623960.1| gnl|CDD|197660 31.91 47 29 2 432 475 4 50 7e-04 36.9 gi|156356500|ref|XP_001623960.1| gnl|CDD|197660 31.48 54 31 3 493 545 6 54 8e-04 36.5 gi|156356500|ref|XP_001623960.1| gnl|CDD|197660 33.33 42 27 1 312 352 2 43 0.003 35.3 gi|156356500|ref|XP_001623960.1| gnl|CDD|119391 23.53 51 34 2 493 542 1 47 8e-04 36.4 gi|156356500|ref|XP_001623960.1| gnl|CDD|119391 21.57 51 35 2 375 424 1 47 0.003 34.5 gi|156356500|ref|XP_001623960.1| gnl|CDD|177721 24.47 94 56 3 463 541 18 111 0.005 38.6 In addition to the different statistical parameters, the CD-Search web service does not filter out compositionally biased regions in the query sequence by default. It uses composition-corrected scoring to mitigate the effects of compositional bias. In contrast, standalone RPS-BLAST filters out compositionally biased segments and does not employ composition-corrected scoring. In the current RPS-BLAST version 2.13.0 (as of Oct. 2022), you can set parameters to replicate CD-Search settings by specifying " -comp_based_stats 1 " and " -seg no " on the command line. If those options are not specified, standalone RPS-BLAST may retrieve somewhat different results. Finally, some advanced options in standalone RPS-BLAST are not available in the web service, such as the ability to use a single-hit/two-pass mode in order to detect more distant homologous relationships . Users who select such options in the standalone version may get different search results with the web service.
TO DO: ADD TEXT HERE (no text exists for this faq as of 3/13/08 but the answer is essentially contained in the answer to "Can I run RPS-BLAST locally") In new text to be added, mention (if true) that the CD summary pages are not available for FTP download, but the PSSM s for (all domain models? or only for those from certain source databases?) are available for download.
A 72-year-old movie has one of the best explanations for time travel.
Every time travel movie has its own set of rules, but one underrated movie from over seven decades ago might have the best explanation of them all.
- I'll Never Forget You provides a unique and realistic approach to time travel, surpassing many modern movies in its storytelling and execution.
- The movie's explanation of time travel, using the analogy of stars, is one of the best and most understandable concepts in the genre.
- The film's focus on the moral and ethical implications of time travel sets it apart, making it an underrated gem in the genre.
Every time travel movie has its own set of rules for the science fiction phenomenon, but one movie from over 70 years ago, I'll Never Forget You , has an explanation that trumps that of modern movies. Arriving nearly 35 years before time travel classic Back to the Future , the 1951 film I’ll Never Forget You follows scientist Peter Standish, whose decision to travel from the 1950s to the 18th century doesn’t quite work out as he had hoped. Rather than leaving behind the complexities of 20th century life for a simpler existence, Peter has a difficult time assimilating to the past. To make matters worse, Peter falls for the wrong woman, jeopardizing both his past and future.
Considering the concept of time travel wouldn’t become mainstream in movies until almost a decade later with The Time Machine in 1960, I’ll Never Forget You ’s love story is way ahead of its time. I’ll Never Forget You ’s tale is fully grounded in reality, which allows it to be easily accessible, shining a light on the moral and ethical implications of Peter’s journey in a way that modern movies often don’t. Not only is I’ll Never Forget You an underrated yet great time travel movie , but its distinct approach to and explanation for time travel remains one of the best.
I'll Never Forget You Perfectly Explains The Time Travel Concept
I’ll Never Forget You ’s explanation of the concept of time travel is one that makes that most sense. According to Peter Standish, the past is still happening around those in the present, and he uses the analogy of stars to convey his point. When people look at stars, they are often seeing them as they were long ago rather than how they are in the present due to the time it takes for light to travel. Peter claims that the problem is that the past is like stars : Those in the present can see it, but can’t reach it yet.
Related: 8 Time Travel Movies That Actually Make Sense
While all time travel movies offer their own rules for the concept of time travel, they often follow conventional standards for how time works. Most time travel movies operate under the assumption that time is linear or nonlinear , but I’ll Never Forget You ’s star analogy is easily one of the best explanations for the concept of time travel. It perfectly describes how the “ past ” can become the new “ present ,” and how time continues to pass in the “ present ”. I’ll Never Forget You ’s “ simultaneous ” concept of time is completely unique, which cements it as one of the most underrated time travel movies.
I'll Never Forget You's Time Travel Story Is Better Than Most Modern Movies
Though I’ll Never Forget You ’s story is straightforward, it manages to outdo those of many modern time travel movies. Other time travel movies can get bogged down in trying to explain its concept with science, which sometimes can lead to the plot becoming overly convoluted. However, I’ll Never Forget You ’s comparatively simple approach allows the movie to fully invest itself in the issues surrounding Peter’s flawed actions in the 18th century. The morality and ethics of time travel is a core tenant of all great movies in this genre, and I’ll Never Forget You ’s ability to focus on these questions contributes to its successful execution.
Part of I’ll Never Forget You ’s plot that puts it ahead of modern movies is its realistic approach . Because Peter keeps slipping up and can’t fit in, his 18th century peers deem him "insane" and take him into custody. While it’s a dismal outcome, it makes sense that someone from modern times wouldn’t be able to assimilate into a previous century, something time travel movies often get wrong . While I’ll Never Forget You is rarely considered in conversations about the greatest time travel movies, its unique yet realistic approach to the subject proves it should be.