Chapter 3 Data transformation

We wrote three functions to particularly scrape the Wikipedia tables. We obtained information about the Oscars winners and nominees over the years (1927-2021) for best actor, best actress, best supporting actor, and best supporting actress, and stored them in the data frames actor_lead, actress_lead, actor_sup, actress_sup respectively.

There were separate tables for each year on Wikipedia, so we combined the rows to make it into one data frame for each category.

We used the same method for the black/Latin/Asian lead/supporting actors and actresses, and stored them in data frames black_actor_lead, black_actor_sup, black_actress_lead, black_actress_sup, and so on with Latin and Asian as well. We found one Egyptian Best Actor winner and put it in a single-row data frame.

3.1 Black Lead Actors and Actresses

## # A tibble: 3 x 6
##   Year  Name             Film                 Role       Status `Milestone / N~`
##   <chr> <chr>            <chr>                <chr>      <chr>  <chr>           
## 1 1958  Sidney Poitier   The Defiant Ones     Noah Cull~ Nomin~ "First Black ac~
## 2 1963  Sidney Poitier   Lilies of the Field  Homer Smi~ Won    "First Black ma~
## 3 1970  James Earl Jones The Great White Hope Jack Jeff~ Nomin~ ""
## # A tibble: 3 x 6
##   Year  Name              Film                 Role      Status `Milestone / N~`
##   <chr> <chr>             <chr>                <chr>     <chr>  <chr>           
## 1 1954  Dorothy Dandridge Carmen Jones         Carmen J~ Nomin~ First African-A~
## 2 1972  Diana Ross        Lady Sings the Blues Billie H~ Nomin~ First African-A~
## 3 1972  Cicely Tyson      Sounder              Rebecca ~ Nomin~ First film to f~

3.2 Latin Lead Actors and Actresses

## # A tibble: 3 x 4
##   Year  Nominee       Film               Result   
##   <chr> <chr>         <chr>              <chr>    
## 1 1950  José Ferrer   Cyrano de Bergerac Won      
## 2 1952  José Ferrer   Moulin Rouge       Nominated
## 3 1957  Anthony Quinn Wild Is the Wind   Nominated
## # A tibble: 3 x 4
##   Year  Nominee                 Film                Result   
##   <chr> <chr>                   <chr>               <chr>    
## 1 1998  Fernanda Montenegro     Central Station     Nominated
## 2 2002  Salma Hayek             Frida               Nominated
## 3 2004  Catalina Sandino Moreno Maria Full of Grace Nominated

3.3 Asian Lead Actors and Actresses

## # A tibble: 3 x 4
##   Year  Name         Film                Status   
##   <chr> <chr>        <chr>               <chr>    
## 1 1956  Yul Brynner  The King and I      Won      
## 2 1971  Topol        Fiddler on the Roof Nominated
## 3 1982  Ben Kingsley Gandhi              Won
## # A tibble: 3 x 4
##   Year  Name         Film                     Status   
##   <chr> <chr>        <chr>                    <chr>    
## 1 1935  Merle Oberon The Dark Angel           Nominated
## 2 1939  Vivien Leigh Gone With The Wind       Won      
## 3 1951  Vivien Leigh A Streetcar Named Desire Won

3.4 Egyptian Lead Actor

##   Year       Name              Film Status
## 1 2018 Rami Malek Bohemian Rhapsody    Won

3.5 All Supporting Actors and Actresses

## # A tibble: 3 x 3
##    Year Actor          Film           
##   <dbl> <chr>          <chr>          
## 1  1936 Walter Brennan Come and Get It
## 2  1936 Mischa Auer    My Man Godfrey 
## 3  1936 Stuart Erwin   Pigskin Parade
## # A tibble: 3 x 3
##    Year Actress          Film              
##   <dbl> <chr>            <chr>             
## 1  1936 Gale Sondergaard Anthony Adverse   
## 2  1936 Beulah Bondi     The Gorgeous Hussy
## 3  1936 Alice Brady      My Man Godfrey

3.6 All Lead Actors and Actresses

## # A tibble: 3 x 3
##    Year Actor               Film                                
##   <dbl> <chr>               <chr>                               
## 1  1928 Emil Jannings       The Last CommandThe Way of All Flesh
## 2  1928 Richard Barthelmess The NooseThe Patent Leather Kid     
## 3  1929 Warner Baxter       In Old Arizona
## # A tibble: 3 x 3
##    Year Actress        Film                                               
##   <dbl> <chr>          <chr>                                              
## 1  1928 Janet Gaynor   7th HeavenStreet AngelSunrise: A Song of Two Humans
## 2  1928 Louise Dresser A Ship Comes In                                    
## 3  1928 Gloria Swanson Sadie Thompson

3.7 Highest Grossing Movies (1977 - 2021)

For the box office data for the highest grossing movies from 1977-2021, we found separate data sets for the box office of each year (1977-2021) and wrote for loops to scrape the Movie and the Box office earnings of the row with the highest domestic grossing value, and made a variable Year. We cleaned the numbers for the box office earnings as well. The resulting data set is a 45x3 data frame.

##       Overall_Movie Overall_Boxoffice Year
## 1         Star Wars         307263857 1977
## 2            Grease         159978870 1978
## 3 Kramer vs. Kramer         106260000 1979

3.8 Box office earnings of Oscar Best Picture 1980 - 2021

For box office earnings for Oscar Best Picture (1980-2021), we first read in the table, selected “Movie”, “Domestic Box Office”, “Release Date” columns, renamed them appropriately, modified “Release Date” to make a new column "Year, and cleaned the numbers such as removing dollar signs and commas for the box office earnings. The resulting data set is a 41x4 data frame.

## # A tibble: 3 x 4
##   Oscar_Movie      Oscar_Boxoffice `Release Date`  Year
##   <chr>                      <dbl> <chr>          <dbl>
## 1 Ordinary People         52302978 Sep 19, 1980    1980
## 2 Chariots of Fire        61558162 Sep 25, 1981    1981
## 3 Gandhi                  52767889 Dec 8, 1982     1982

3.9 Oscar Best Picture Winners with Female Lead

For the list of Oscar Best Picture winners with female lead, we added the corresponding Year and Movie to an excel file with those as the columns and read it in a data frame.

## # A tibble: 3 x 2
##    Year Movie                
##   <dbl> <chr>                
## 1  1935 It Happened One Night
## 2  1940 Gone With the Wind   
## 3  1941 Rebecca

3.10 Box office of Winners with female lead and the Nominees that year

For the box office data of Winners with female lead and the Nominees of those 16 years, we found 16 separate data set for the whole box office of those years, selected for each year the rows of the winner and the nominees, and added them to an excel file and read it into a data frame.

## # A tibble: 3 x 4
##    Year Movie                 Boxoffice Status
##   <dbl> <chr>                     <dbl> <chr> 
## 1  1935 It Happened One Night       5.2 W     
## 2  1935 Cleopatra                   5.5 N     
## 3  1935 Les Misérables              3.4 N

3.11 IMDB Ratings of Oscar

For the IMDB data for Oscars Best Picture winners and nominations over the years (1939-2021), we scraped Movie, Year, Genre, Running time, Rating and the link for the image of the movie poster to make it into a data frame.

##                         Movie Year                                   Genre
## 1                   Nomadland 2020                     \nDrama            
## 2                  The Father 2020            \nDrama, Mystery            
## 3 Judas and the Black Messiah 2021 \nBiography, Drama, History            
##      Time Rating
## 1 107 min    7.3
## 2  97 min    8.3
## 3 126 min    7.4
##                                                                                                                                                  Image
## 1 https://m.media-amazon.com/images/M/MV5BMDRiZWUxNmItNDU5Yy00ODNmLTk0M2ItZjQzZTA5OTJkZjkyXkEyXkFqcGdeQXVyMTkxNjUyNQ@@._V1_UX140_CR0,0,140,209_AL_.jpg
## 2 https://m.media-amazon.com/images/M/MV5BZGJhNWRiOWQtMjI4OS00ZjcxLTgwMTAtMzQ2ODkxY2JkOTVlXkEyXkFqcGdeQXVyMTkxNjUyNQ@@._V1_UY209_CR0,0,140,209_AL_.jpg
## 3 https://m.media-amazon.com/images/M/MV5BMjBjYjBjNTUtOTg0Ni00Yzk2LTg1NWMtNjI2ODk2YjJmZGU0XkEyXkFqcGdeQXVyODE5NzE3OTE@._V1_UX140_CR0,0,140,209_AL_.jpg

3.12 Number of Times Actors won an Oscar

## # A tibble: 3 x 2
##   Name           Wins
##   <chr>         <dbl>
## 1 Paul Muni         1
## 2 Fredric March     2
## 3 Gary Cooper       2

3.13 Number of Times Actresses won an Oscar

## # A tibble: 3 x 2
##   Name               Wins
##   <chr>             <dbl>
## 1 Norma Shearer         1
## 2 Irene Dunne           0
## 3 Katharine Hepburn     4

We have 18 data sets including actor_lead, actress_lead, actor_sup, actress_sup, black_actor_lead, black_actress_lead, latin_actor_lead, latin_actress_lead, asian_actor_lead, asian_actress_lead, egyptian_actor_lead, highest_grossing, oscar_bo, female_winners, female_nom_bo, actor_wins, actress_wins, and imdb_oscar.